Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Varlist not recognised by foreach

    I am trying to determine the optimal number of clusters to use in a kmeans clustering problem. I am clustering on three variables: vstd_schl_cls, nstd_schl_cls, size_std_cls (standardized values of verbal scores, numeric scores, and class sizes, respectively). For -cluster kmeans-, I have used different starting points for k.

    Code:
    local list2 "vstd_schl_cls nstd_schl_cls size_std_cls"
    forvalues k = 1(1)20 {
        cluster kmeans `list2', k(`k') start(random(123)) name(cs`k') mea(abs) keepcen
    }
    A sample of my data is below:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double(vstd_schl_cls nstd_schl_cls size_std_cls) byte cs1 double(cs2 cs3 cs4 cs5)
     .5896847248077393 .08571480214595795 1.014149785041809 1 2 3 3 4
     1.527132511138916 1.0911247730255127 1.014149785041809 1 2 3 4 4
    .27720215916633606 1.6656447649002075 1.014149785041809 1 2 2 4 2
     .9021673202514648 1.6656447649002075 1.014149785041809 1 1 1 1 1
     .9021673202514648  .9474948048591614 1.014149785041809 1 1 2 2 1
    end
    When I try to find the optimal number of clusters via ANOVA for each variable and stored cluster, I get an error: "varlist not allowed" . This is my code:

    Code:
    matrix WSS = J(20,5,.)
    matrix colnames WSS = k WSS log(WSS) eta-squared PRE
     * WSS for each clustering
    set trace on
     forvalues k = 1(1)20 {
        scalar ws`k´ = 0
        foreach v of list2 {
            quietly anova `v´ cs`k´
            scalar ws`k´ = ws`k´ + e(rss)
     }
        matrix WSS[`k´, 1] = `k´
        matrix WSS[`k´, 2] = ws`k´
        matrix WSS[`k´, 3] = log(ws`k´)
        matrix WSS[`k´, 4] = 1 - ws`k´/WSS[1,2]
        matrix WSS[`k´, 5] = (WSS[`k´-1,2] - ws`k´)/WSS[`k´-1,2]
    }
    Why am I getting this error? I have also tried "foreach v of local list2"; "foreach v of local `list2' ; "foreach v of varlist `list2'.

    Is my syntax wrong?

  • #2
    Originally posted by Saunok Chakrabarty View Post
    Is my syntax wrong?
    Well, yeah, foreach v of list2 is wrong; two of your three alternatives in the penultimate sentence of your post should work, though.

    But that's not where your error message is coming from. It's coming from this line.
    Code:
    scalar ws`k´ = 0
    Try this instead.
    Code:
    scalar ws`k' = 0
    You'll get an error for all of the other instances of `k´ later in your code, too, and so you might as well correct them all while you're at it.

    Is it common to use ANOVA in this way in k-means clustering?

    Comment


    • #3
      Joseph, thanks for your answer. I had overlooked the quotes around the macros and scalars.

      I'm not sure if this is a common approach. I was trying to determine if there was an optimal cluster size, depending on the data. This paper in The Stata Journal -https://journals.sagepub.com/doi/pdf...867X1201200213 recommended this approach. I suspect the usefulness of this approach will depend on the data. In my case, after making the correction, it did not turn out to be especially useful.

      Regards,
      Saunok

      Comment


      • #4
        You might consider DBSCAN as an alternative to kmeans. Kmeans isn't even guaranteed to give the same set of clusters on a given value of k without setting the seed, so who knows whether you've found the optimal solution?

        Comment


        • #5
          Originally posted by Saunok Chakrabarty View Post
          . . . This paper in The Stata Journal . . . recommended this approach. I suspect the usefulness of this approach will depend on the data. In my case, after making the correction, it did not turn out to be especially useful.
          OK, thanks for the follow-up. I wasn't aware of any of that.

          Comment


          • #6
            Hi Daniel,

            I haven't tried out DBSCAN -- I didn't know what that was. But I will try to look into it, since it seems to not require a prior number of clusters. Thanks!

            Regards,
            Saunok

            Comment


            • #7
              Back in 2011 I wrote

              What has been called the classification crunch amounts to this: If you have well-distinguished clusters, some simple sensible graphical method will show you what they are. If you don't, you should lay in supplies for endless experimentation with how cluster dissimilarity is defined. how observations or clusters should be grouped into larger clusters, and so forth.
              and my prejudices survive unscathed. Positively, I suggest firing up

              Code:
              graph matrix vstd_schl_cls nstd_schl_cls size_std_cls
              and also a quick PCA

              Code:
              pca vstd_schl_cls nstd_schl_cls size_std_cls
              
              predict score1 score2
              
              scatter score1 score2
              If you can see clearly defined clusters in such plots, well and good. Naturally, neither approach offers a formal method to identify clusters, but just looking at the data is crucial too.

              Comment


              • #8
                I want to generate a variable that counts the number of "Yes" a respondent provided for a given set of questions. How do I go about it?

                Comment


                • #9
                  #8 as worded here has no bearing on the thread title or topic. Please start a new thread with an informative title.

                  Comment


                  • #10
                    Hi Nick,

                    Thanks for the suggestion. I'll inspect the data once more graphically, to see if they indicate any clue towards well-defined clusters. The PCA didn't come to my mind - thanks!

                    Regards,
                    Saunok

                    Comment

                    Working...
                    X