Varlist not recognised by foreach

Saunok Chakrabarty

Join Date: Aug 2019
Posts: 43

Varlist not recognised by foreach

05 Feb 2024, 23:28

I am trying to determine the optimal number of clusters to use in a kmeans clustering problem. I am clustering on three variables: vstd_schl_cls, nstd_schl_cls, size_std_cls (standardized values of verbal scores, numeric scores, and class sizes, respectively). For -cluster kmeans-, I have used different starting points for k.

Code:

local list2 "vstd_schl_cls nstd_schl_cls size_std_cls"
forvalues k = 1(1)20 {
    cluster kmeans `list2', k(`k') start(random(123)) name(cs`k') mea(abs) keepcen
}

A sample of my data is below:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double(vstd_schl_cls nstd_schl_cls size_std_cls) byte cs1 double(cs2 cs3 cs4 cs5)
 .5896847248077393 .08571480214595795 1.014149785041809 1 2 3 3 4
 1.527132511138916 1.0911247730255127 1.014149785041809 1 2 3 4 4
.27720215916633606 1.6656447649002075 1.014149785041809 1 2 2 4 2
 .9021673202514648 1.6656447649002075 1.014149785041809 1 1 1 1 1
 .9021673202514648  .9474948048591614 1.014149785041809 1 1 2 2 1
end

When I try to find the optimal number of clusters via ANOVA for each variable and stored cluster, I get an error: "varlist not allowed" . This is my code:

Code:

matrix WSS = J(20,5,.)
matrix colnames WSS = k WSS log(WSS) eta-squared PRE
 * WSS for each clustering
set trace on
 forvalues k = 1(1)20 {
    scalar ws`k´ = 0
    foreach v of list2 {
        quietly anova `v´ cs`k´
        scalar ws`k´ = ws`k´ + e(rss)
 }
    matrix WSS[`k´, 1] = `k´
    matrix WSS[`k´, 2] = ws`k´
    matrix WSS[`k´, 3] = log(ws`k´)
    matrix WSS[`k´, 4] = 1 - ws`k´/WSS[1,2]
    matrix WSS[`k´, 5] = (WSS[`k´-1,2] - ws`k´)/WSS[`k´-1,2]
}

Why am I getting this error? I have also tried "foreach v of local list2"; "foreach v of local `list2' ; "foreach v of varlist `list2'.

Is my syntax wrong?

Tags: None

Joseph Coveney

Join Date: Apr 2014

Posts: 4354
#2

06 Feb 2024, 00:58

Originally posted by Saunok Chakrabarty View Post

Is my syntax wrong?

Well, yeah, foreach v of list2 is wrong; two of your three alternatives in the penultimate sentence of your post should work, though.

But that's not where your error message is coming from. It's coming from this line.

Code:

scalar ws`k´ = 0

Try this instead.

Code:

scalar ws`k' = 0

You'll get an error for all of the other instances of `k´ later in your code, too, and so you might as well correct them all while you're at it.

Is it common to use ANOVA in this way in k-means clustering?
Comment
Saunok Chakrabarty

Join Date: Aug 2019

Posts: 43
#3

06 Feb 2024, 14:03

Joseph, thanks for your answer. I had overlooked the quotes around the macros and scalars.

I'm not sure if this is a common approach. I was trying to determine if there was an optimal cluster size, depending on the data. This paper in The Stata Journal -https://journals.sagepub.com/doi/pdf...867X1201200213 recommended this approach. I suspect the usefulness of this approach will depend on the data. In my case, after making the correction, it did not turn out to be especially useful.

Regards,
Saunok
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 806
#4

06 Feb 2024, 16:30

You might consider DBSCAN as an alternative to kmeans. Kmeans isn't even guaranteed to give the same set of clusters on a given value of k without setting the seed, so who knows whether you've found the optimal solution?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4354
#5

06 Feb 2024, 20:17

Originally posted by Saunok Chakrabarty View Post

. . . This paper in The Stata Journal . . . recommended this approach. I suspect the usefulness of this approach will depend on the data. In my case, after making the correction, it did not turn out to be especially useful.

OK, thanks for the follow-up. I wasn't aware of any of that.
1 like
Comment
Saunok Chakrabarty

Join Date: Aug 2019

Posts: 43
#6

07 Feb 2024, 11:33

Hi Daniel,

I haven't tried out DBSCAN -- I didn't know what that was. But I will try to look into it, since it seems to not require a prior number of clusters. Thanks!

Regards,
Saunok
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35213
#7

07 Feb 2024, 12:09

Back in 2011 I wrote

What has been called the classification crunch amounts to this: If you have well-distinguished clusters, some simple sensible graphical method will show you what they are. If you don't, you should lay in supplies for endless experimentation with how cluster dissimilarity is defined. how observations or clusters should be grouped into larger clusters, and so forth.

and my prejudices survive unscathed. Positively, I suggest firing up

Code:

graph matrix vstd_schl_cls nstd_schl_cls size_std_cls

and also a quick PCA

Code:

pca vstd_schl_cls nstd_schl_cls size_std_cls predict score1 score2 scatter score1 score2

If you can see clearly defined clusters in such plots, well and good. Naturally, neither approach offers a formal method to identify clusters, but just looking at the data is crucial too.
Comment
Emmanuel Alengo

Join Date: Oct 2023

Posts: 1
#8

08 Feb 2024, 02:46

I want to generate a variable that counts the number of "Yes" a respondent provided for a given set of questions. How do I go about it?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35213
#9

08 Feb 2024, 02:53

#8 as worded here has no bearing on the thread title or topic. Please start a new thread with an informative title.
Comment
Saunok Chakrabarty

Join Date: Aug 2019

Posts: 43
#10

08 Feb 2024, 20:25

Hi Nick,

Thanks for the suggestion. I'll inspect the data once more graphically, to see if they indicate any clue towards well-defined clusters. The PCA didn't come to my mind - thanks!

Regards,
Saunok
Comment

Announcement

Varlist not recognised by foreach

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment