Hello,
I would like to use the Davies-Bouldin Index to determine the optimal number of clusters in a cluster analysis I am doing.
Does anyone know if there is a command to implement this particular index in STATA? Or does a code exist somewhere that can be applied in STATA?
I know there’s a command that calculates the Calinski–Harabasz pseudo - ´ F stopping-rule index, and I have already used it. But I’m trying to reproduce the methodology of an article that implements the Davies-Bouldin Index.
Here are the steps to follow:
1) Build different models by separating my database into cluster k = 2/15 :
forvalues i = 2/15 {
cluster kmeans varlist, k(`i’) name(cluster_`i’)
}
* My varlist is composed of 20 variables.
2) For each model, I would like to calculate and minimize the following validity Index :
Validity_`i’ = Intra_`i’/Inter_`i’
Validity is the ratio of the sum of within-cluster scatter to between-cluster separation. Then the objective is to minimize this measure as we want to minimize the within-cluster scatter and maximize the between-cluster separation.
Another element is that I’m working with survey data. So if it’s possible to take into account weights, it would be ideal !
Thanks for your help!
Alexandre Parent
Employment and Social Development Canada
I would like to use the Davies-Bouldin Index to determine the optimal number of clusters in a cluster analysis I am doing.
Does anyone know if there is a command to implement this particular index in STATA? Or does a code exist somewhere that can be applied in STATA?
I know there’s a command that calculates the Calinski–Harabasz pseudo - ´ F stopping-rule index, and I have already used it. But I’m trying to reproduce the methodology of an article that implements the Davies-Bouldin Index.
Here are the steps to follow:
1) Build different models by separating my database into cluster k = 2/15 :
forvalues i = 2/15 {
cluster kmeans varlist, k(`i’) name(cluster_`i’)
}
* My varlist is composed of 20 variables.
2) For each model, I would like to calculate and minimize the following validity Index :
Validity_`i’ = Intra_`i’/Inter_`i’
Validity is the ratio of the sum of within-cluster scatter to between-cluster separation. Then the objective is to minimize this measure as we want to minimize the within-cluster scatter and maximize the between-cluster separation.
Another element is that I’m working with survey data. So if it’s possible to take into account weights, it would be ideal !
Thanks for your help!
Alexandre Parent
Employment and Social Development Canada