Clustering variables instead of observations

Martinо Cоmelli

Join Date: Jul 2014
Posts: 36

Clustering variables instead of observations

12 Jun 2024, 14:35

I would like to cluster categorical variables, instead of observations. And then build a dendogram with those. On the documentation of "cluster" it says:

Clustering variables instead of observations
Sometimes you want to cluster variables rather than observations, so you can use thecluster
command. One approach to clustering variables in Stata is to usexpose(see [D]xpose) to transpose
the variables and observations and then to usecluster. Another approach is to use thematrix
dissimilaritycommand with thevariablesoption (see [MV]matrix dissimilarity) to produce
a dissimilarity matrix for the variables. This matrix is then passed toclustermatto obtain the
hierarchical clustering. See [MV]clustermat.

And then on the documentation of cluster linkage there is an example that looks like my case, Example 3.

Click image for larger version

Name: 2024-06-12_21-28-34.png
Views: 1
Size: 61.2 KB
ID: 1756056

now, i need to do the same with my dataset.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(S2_KAT S4 Q27A MENTALITY CITY EU NATO COALITION LEFTRIGHT MONEY CLASS END_MONTH)
6 4 2 0 1 2 1 2 1 3 1 .
3 2 1 1 1 1 1 1 1 2 2 1
6 2 1 1 1 2 2 1 2 3 1 .
3 4 . 1 2 2 2 1 . 4 1 2
3 4 2 0 1 2 . 2 . 4 1 .
6 3 2 0 2 . . . 2 1 2 .
5 3 . 0 1 2 2 2 2 2 2 1
5 4 2 0 2 . . 2 2 4 1 2
5 2 1 . 1 2 2 . . 3 1 1
4 3 1 1 2 2 2 1 2 3 2 .
3 3 1 1 1 2 2 1 1 4 2 2
4 4 1 1 2 2 2 1 1 4 1 .
6 4 1 1 2 2 2 1 2 3 1 .
4 2 2 . 1 2 2 2 2 2 2 1
5 2 . 1 1 . . . . 2 1 1
4 3 . 0 1 1 1 . 1 2 1 1
5 2 2 . 1 1 1 . 2 1 2 .
6 4 2 0 1 . . 1 2 1 1 1
3 3 1 1 3 . 2 1 1 2 2 2
6 2 2 . 1 2 2 1 2 1 1 .
6 4 2 . 2 2 1 2 1 3 1 2
6 2 2 0 2 1 1 2 1 1 1 1
6 2 2 . 3 2 . 1 . 2 . 2
3 4 2 . 1 2 2 . 1 3 1 .
5 2 2 . 1 2 . . . . 2 .
2 4 2 . 1 2 2 1 2 3 1 .
3 4 1 1 2 2 2 1 1 4 1 2
2 3 2 . 1 1 1 1 1 1 1 1
3 4 1 1 1 2 2 1 2 4 1 .
5 2 . 1 1 1 1 . 2 1 2 1
6 3 2 0 2 2 . 2 1 3 2 1
2 2 2 0 2 2 2 2 1 3 2 .
6 2 2 0 3 1 1 1 . 2 2 2
3 4 1 1 2 2 2 1 . 4 1 2
3 3 2 1 1 1 1 1 . 2 2 .
2 4 . 0 2 2 2 . . 3 1 .
2 3 2 1 1 2 2 1 . 4 1 .
5 2 1 1 1 2 2 1 . 4 2 .
5 4 1 1 2 2 2 1 . 3 1 .
6 2 1 1 1 2 2 1 2 2 1 2
2 3 2 . 3 . . 2 1 2 2 .
2 4 1 1 1 . . 1 2 2 1 1
5 3 1 1 2 2 2 1 2 1 2 2
5 3 1 1 1 2 2 1 2 3 1 .
3 4 1 1 2 2 2 1 2 4 1 2
3 3 1 1 1 2 2 . 1 3 2 1
2 4 1 0 3 2 2 . 1 4 1 .
6 3 2 0 1 . 2 1 1 1 1 .
2 3 1 . 1 . . 1 . 1 1 .
5 3 1 1 3 2 2 1 2 1 1 1
6 2 1 . 2 2 2 1 . 2 2 .
5 2 2 . 2 2 2 . . 3 2 .
5 2 2 1 1 2 1 2 1 1 1 1
4 3 1 1 3 2 2 1 . 1 1 1
3 3 2 1 2 2 2 1 1 1 1 1
6 3 2 0 2 2 2 2 1 1 2 .
3 4 2 1 2 2 2 1 . 3 1 .
6 2 2 0 2 . 1 2 1 1 2 1
6 3 2 1 2 2 1 2 1 3 1 1
5 3 2 0 1 2 2 2 2 2 1 1
4 3 1 1 3 2 2 1 2 2 . .
6 3 1 . 3 2 2 1 2 1 1 1
3 2 2 1 2 2 2 2 . 2 1 1
6 3 1 . 1 2 2 1 2 1 2 .
3 4 1 1 3 2 2 1 2 4 1 2
2 4 . 1 2 2 2 . 1 1 1 1
5 2 . 1 1 2 1 . . 1 2 1
3 2 1 . 1 2 2 1 . 4 1 2
2 3 1 1 1 . . 1 2 3 1 1
5 3 2 0 1 2 2 2 1 1 1 1
6 3 1 . 1 2 2 1 2 1 2 .
3 4 . 0 3 . . 1 . 1 1 1
2 3 1 1 2 2 2 1 . 4 1 2
2 2 1 . 1 2 2 1 2 1 2 .
6 3 2 0 2 . . 2 1 3 1 .
4 4 1 . 3 2 2 1 2 2 1 .
6 4 1 1 3 2 2 1 . 1 1 2
3 4 1 1 1 2 2 1 . 3 1 .
2 4 . . 1 2 . . . 1 1 .
3 2 . . 2 1 1 . 1 1 . 2
6 3 1 0 2 2 2 . . 3 1 1
5 4 2 . 2 . . . . 2 1 .
6 4 1 . 1 2 2 1 2 2 1 .
3 4 2 . 1 . 1 1 . 4 1 2
5 3 1 . 3 1 1 1 2 1 2 1
5 4 1 1 2 2 1 2 . 1 2 .
6 2 2 . 2 . . 2 1 3 1 1
5 3 2 0 2 2 2 2 2 4 1 2
6 3 1 1 3 2 2 1 1 3 1 .
3 4 2 1 2 2 2 1 2 2 2 1
3 4 1 1 1 2 2 1 . 2 1 1
3 2 1 . 3 1 1 2 1 1 2 1
6 3 . . 2 2 2 . 2 . 2 .
6 4 1 1 3 2 2 1 2 3 1 2
6 3 2 1 2 1 1 2 2 1 2 1
5 4 1 1 1 2 2 1 2 4 1 2
6 3 2 1 3 2 2 2 1 1 2 1
2 2 2 . 3 1 1 1 1 1 2 1
2 3 . 1 2 . . 1 . 1 2 1
3 4 1 1 2 2 2 1 2 4 2 .
end
label values S2_KAT labels1
label def labels1 2 "18-29", modify
label def labels1 3 "30-39", modify
label def labels1 4 "40-49", modify
label def labels1 5 "50-59", modify
label def labels1 6 "60 a viac", modify
label values S4 labels3
label def labels3 2 "bez_maturity", modify
label def labels3 3 "maturita", modify
label def labels3 4 "vysokoskolske", modify
label values Q27A labels49
label def labels49 1 "Ivan Korčok", modify
label def labels49 2 "Peter Pellegrini", modify
label values MENTALITY MENTALITY
label def MENTALITY 0 "SK not backward", modify
label def MENTALITY 1 "SK backward", modify
label values CITY CITY
label def CITY 1 "Village (up to 4999)", modify
label def CITY 2 "City (from 5000)", modify
label def CITY 3 "more than 100000", modify
label values EU EU
label def EU 1 "NO EU", modify
label def EU 2 "PRO EU", modify
label values NATO NATO
label def NATO 1 "NO NATO", modify
label def NATO 2 "PRO NATO", modify
label values COALITION COALITION
label def COALITION 1 "Opposition", modify
label def COALITION 2 "In Power", modify
label values LEFTRIGHT LEFTRIGHT
label def LEFTRIGHT 1 "Left", modify
label def LEFTRIGHT 2 "right", modify
label values MONEY MONEY
label def MONEY 1 "up to 1050 eur", modify
label def MONEY 2 "1 051 € to 1 400 €", modify
label def MONEY 3 "1 401 € to 2 500 €", modify
label def MONEY 4 "more than 2501 eur", modify
label values CLASS CLASS
label def CLASS 1 "White collars", modify
label def CLASS 2 "Workers", modify
label values END_MONTH END_MONTH
label def END_MONTH 1 "Difficult", modify
label def END_MONTH 2 "Easy", modify

I have tried to follow the example above, but I get stuck every time. Who can suggest the correct syntax?

thank you for your attention!

Tags: None

Martinо Cоmelli

Join Date: Jul 2014

Posts: 36
#2

13 Jun 2024, 02:54

UPDATE: I am not sure that the strategy in Example 3 is the best one for me. I would like to perform clustering with categorical variables, totaling 31 options (as in the sample dataset posted above). However, the leaves of my dendrogram following the example in STATA are more than 200, which is a mistake. Can somebody help?
Comment
Martinо Cоmelli

Join Date: Jul 2014

Posts: 36
#3

13 Jun 2024, 04:55

this is what i need to do, a dendogram with each variables values. how can i do it with stata?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#4

14 Jun 2024, 04:55

You need to be clear first that you have a way to judge similarity or dissimilarity between variables on quite different scales. On the face of it some are nominal and some are ordinal.
2 likes
Comment
Martinо Cоmelli

Join Date: Jul 2014

Posts: 36
#5

15 Jun 2024, 08:27

Originally posted by Nick Cox View Post

You need to be clear first that you have a way to judge similarity or dissimilarity between variables on quite different scales. On the face of it some are nominal and some are ordinal.

thank you for giving a look! I could just use nominal to make it easier. I think there is just one ordinal one, that i can exclude, the rest are either dummies or categorical/nominal.
Do you have an idea how i can compute this? Following the example above I am not going very far.
Comment
Martinо Cоmelli

Join Date: Jul 2014

Posts: 36
#6

15 Jun 2024, 08:47

For sure i can proceeded and splitting each variables in, let's MENTALITY_0 and MENTALITY_1 etc etc, but then what?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#7

16 Jun 2024, 02:54

Sorry, but I don't have a clear idea of what you want to do — or a better idea of what to do. In principle, we should be interested in clusterings that expose cluster structure. In practice, with your kind of data, they are rarely helpful.

Last edited by Nick Cox; 16 Jun 2024, 03:40.
Comment
Martinо Cоmelli

Join Date: Jul 2014

Posts: 36
#8

16 Jun 2024, 08:14

thank you for checking again!

the dendogram above has been done with my data, and it makes sense (to me at least ahah), my goal here would be to replicate in stata, and i am not sure how

the idea is to see how each categorical variable answers correlate with each other. in the example above, all the categorical variables have been spitted (like the dummy END_MONTH_easy or END_MONTH_difficult), and a deprogram has been created: we see that END_MONTH_difficult is closer to people that live in villages. etc.

I would like to reproduce this in stata. I tried to follow the documentation, but I am not sure what I am doing wrong...
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#9

16 Jun 2024, 12:57

This thread is still puzzling to me because the question keeps changing, or at least I don't follow the twists and turns of different posts.

Perhaps as extreme cases

1. You've got results in Stata but you don't believe them. #2

2. You've got results outside Stata and you want to replicate them in Stata. #8

I don't have any suggestion under either or both scenarios. You don't give any code that you used, whether inside or outside Stata.

The label definitions alone in #1 suggest 5 x 3 x 2 x 2 x 3 x 2 x 2 x 2 x 2 x 4 x 2 x 2 possible cross-combinations I make that 46080
and even if many don't occur, a dendrogram with hundreds of leaves doesn't surprise me. (If I missed some or double counted, sorry, but I think the point is unaffected.)

This sort of problem is one of several reasons why clustering is often disappointing for categorical data, even if the variables are measured consistently. For example, even 10 binary variables means 2^10 = 1024 cross-combinations. If you're lucky a dendrogram has simple form, but that is not guaranteed.
1 like
Comment

Martinо Cоmelli

Join Date: Jul 2014
Posts: 36

#10

17 Jun 2024, 03:25

ahaha i am sorry! the question, at least in my mind, is the same. my goal is literally to reproduce in stata the dendogram i showed above!

This is the python that generated the image:

Code:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist

# Load the dataset
file_path = '/Users/kome/Downloads/Alchimia/Prova_ChatGPT3.csv'
data = pd.read_csv(file_path)

# Specify the columns of interest including 'MENTALITY'
selected_columns_with_mentality = ['CITY', 'EU', 'V4', 'NATO', 'INT_POL', 'COALITION', 'LEFTRIGHT', 'MONEY', 'CLASS', 'END_MONTH', 'MENTALITY']

# Filter the data to include only the selected columns
data_selected_with_mentality = data[selected_columns_with_mentality].dropna()

# One-hot encode the selected categorical variables
encoder = OneHotEncoder()
encoded_selected_data_with_mentality = encoder.fit_transform(data_selected_with_mentality)

# Transpose the encoded data to treat categories as observations
binary_matrix_selected_with_mentality = encoded_selected_data_with_mentality.T

# Compute the distance matrix for the binary matrix
distance_matrix_selected_with_mentality = pdist(binary_matrix_selected_with_mentality.toarray(), metric='hamming')

# Define linkage methods
linkage_methods = ['single', 'complete', 'weighted', 'centroid']

# Generate and plot dendrograms for each linkage method
for method in linkage_methods:
    # Perform hierarchical clustering using the current linkage method
    linked_method = linkage(distance_matrix_selected_with_mentality, method=method)
    
    # Plot the dendrogram with the current linkage method
    plt.figure(figsize=(12, 8))
    dendrogram(linked_method, labels=encoder.get_feature_names_out(selected_columns_with_mentality), orientation='top', leaf_rotation=90)
    plt.title(f'Hierarchical Clustering Dendrogram ({method.capitalize()} Linkage)')
    plt.xlabel('Categories')
    plt.ylabel('Distance')
    plt.show()

Ant it works!!! but i cannot figure out how to to transpose and compute the distance matrix in stata. there should be a method, but i cannot figure out how!

Checking the documentation, basically i need to reproduce example 3: https://www.stata.com/manuals/mvclusterlinkage.pdf Basically i want to turn every of my categorical in a dummy, and see how they are "correlated" with a dedrogram

Anyway! I am sorry for expressing myself in a confused way. But if anyone see what i mean, i would need some help!

Last edited by Martinо Cоmelli; 17 Jun 2024, 04:17.

Announcement

Clustering variables instead of observations

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment