Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Clustering variables instead of observations

    I would like to cluster categorical variables, instead of observations. And then build a dendogram with those. On the documentation of "cluster" it says:

    Clustering variables instead of observations
    Sometimes you want to cluster variables rather than observations, so you can use thecluster
    command. One approach to clustering variables in Stata is to usexpose(see [D]xpose) to transpose
    the variables and observations and then to usecluster. Another approach is to use thematrix
    dissimilaritycommand with thevariablesoption (see [MV]matrix dissimilarity) to produce
    a dissimilarity matrix for the variables. This matrix is then passed toclustermatto obtain the
    hierarchical clustering. See [MV]clustermat.
    And then on the documentation of cluster linkage there is an example that looks like my case, Example 3.


    Click image for larger version

Name:	2024-06-12_21-28-34.png
Views:	1
Size:	61.2 KB
ID:	1756056


    now, i need to do the same with my dataset.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(S2_KAT S4 Q27A MENTALITY CITY EU NATO COALITION LEFTRIGHT MONEY CLASS END_MONTH)
    6 4 2 0 1 2 1 2 1 3 1 .
    3 2 1 1 1 1 1 1 1 2 2 1
    6 2 1 1 1 2 2 1 2 3 1 .
    3 4 . 1 2 2 2 1 . 4 1 2
    3 4 2 0 1 2 . 2 . 4 1 .
    6 3 2 0 2 . . . 2 1 2 .
    5 3 . 0 1 2 2 2 2 2 2 1
    5 4 2 0 2 . . 2 2 4 1 2
    5 2 1 . 1 2 2 . . 3 1 1
    4 3 1 1 2 2 2 1 2 3 2 .
    3 3 1 1 1 2 2 1 1 4 2 2
    4 4 1 1 2 2 2 1 1 4 1 .
    6 4 1 1 2 2 2 1 2 3 1 .
    4 2 2 . 1 2 2 2 2 2 2 1
    5 2 . 1 1 . . . . 2 1 1
    4 3 . 0 1 1 1 . 1 2 1 1
    5 2 2 . 1 1 1 . 2 1 2 .
    6 4 2 0 1 . . 1 2 1 1 1
    3 3 1 1 3 . 2 1 1 2 2 2
    6 2 2 . 1 2 2 1 2 1 1 .
    6 4 2 . 2 2 1 2 1 3 1 2
    6 2 2 0 2 1 1 2 1 1 1 1
    6 2 2 . 3 2 . 1 . 2 . 2
    3 4 2 . 1 2 2 . 1 3 1 .
    5 2 2 . 1 2 . . . . 2 .
    2 4 2 . 1 2 2 1 2 3 1 .
    3 4 1 1 2 2 2 1 1 4 1 2
    2 3 2 . 1 1 1 1 1 1 1 1
    3 4 1 1 1 2 2 1 2 4 1 .
    5 2 . 1 1 1 1 . 2 1 2 1
    6 3 2 0 2 2 . 2 1 3 2 1
    2 2 2 0 2 2 2 2 1 3 2 .
    6 2 2 0 3 1 1 1 . 2 2 2
    3 4 1 1 2 2 2 1 . 4 1 2
    3 3 2 1 1 1 1 1 . 2 2 .
    2 4 . 0 2 2 2 . . 3 1 .
    2 3 2 1 1 2 2 1 . 4 1 .
    5 2 1 1 1 2 2 1 . 4 2 .
    5 4 1 1 2 2 2 1 . 3 1 .
    6 2 1 1 1 2 2 1 2 2 1 2
    2 3 2 . 3 . . 2 1 2 2 .
    2 4 1 1 1 . . 1 2 2 1 1
    5 3 1 1 2 2 2 1 2 1 2 2
    5 3 1 1 1 2 2 1 2 3 1 .
    3 4 1 1 2 2 2 1 2 4 1 2
    3 3 1 1 1 2 2 . 1 3 2 1
    2 4 1 0 3 2 2 . 1 4 1 .
    6 3 2 0 1 . 2 1 1 1 1 .
    2 3 1 . 1 . . 1 . 1 1 .
    5 3 1 1 3 2 2 1 2 1 1 1
    6 2 1 . 2 2 2 1 . 2 2 .
    5 2 2 . 2 2 2 . . 3 2 .
    5 2 2 1 1 2 1 2 1 1 1 1
    4 3 1 1 3 2 2 1 . 1 1 1
    3 3 2 1 2 2 2 1 1 1 1 1
    6 3 2 0 2 2 2 2 1 1 2 .
    3 4 2 1 2 2 2 1 . 3 1 .
    6 2 2 0 2 . 1 2 1 1 2 1
    6 3 2 1 2 2 1 2 1 3 1 1
    5 3 2 0 1 2 2 2 2 2 1 1
    4 3 1 1 3 2 2 1 2 2 . .
    6 3 1 . 3 2 2 1 2 1 1 1
    3 2 2 1 2 2 2 2 . 2 1 1
    6 3 1 . 1 2 2 1 2 1 2 .
    3 4 1 1 3 2 2 1 2 4 1 2
    2 4 . 1 2 2 2 . 1 1 1 1
    5 2 . 1 1 2 1 . . 1 2 1
    3 2 1 . 1 2 2 1 . 4 1 2
    2 3 1 1 1 . . 1 2 3 1 1
    5 3 2 0 1 2 2 2 1 1 1 1
    6 3 1 . 1 2 2 1 2 1 2 .
    3 4 . 0 3 . . 1 . 1 1 1
    2 3 1 1 2 2 2 1 . 4 1 2
    2 2 1 . 1 2 2 1 2 1 2 .
    6 3 2 0 2 . . 2 1 3 1 .
    4 4 1 . 3 2 2 1 2 2 1 .
    6 4 1 1 3 2 2 1 . 1 1 2
    3 4 1 1 1 2 2 1 . 3 1 .
    2 4 . . 1 2 . . . 1 1 .
    3 2 . . 2 1 1 . 1 1 . 2
    6 3 1 0 2 2 2 . . 3 1 1
    5 4 2 . 2 . . . . 2 1 .
    6 4 1 . 1 2 2 1 2 2 1 .
    3 4 2 . 1 . 1 1 . 4 1 2
    5 3 1 . 3 1 1 1 2 1 2 1
    5 4 1 1 2 2 1 2 . 1 2 .
    6 2 2 . 2 . . 2 1 3 1 1
    5 3 2 0 2 2 2 2 2 4 1 2
    6 3 1 1 3 2 2 1 1 3 1 .
    3 4 2 1 2 2 2 1 2 2 2 1
    3 4 1 1 1 2 2 1 . 2 1 1
    3 2 1 . 3 1 1 2 1 1 2 1
    6 3 . . 2 2 2 . 2 . 2 .
    6 4 1 1 3 2 2 1 2 3 1 2
    6 3 2 1 2 1 1 2 2 1 2 1
    5 4 1 1 1 2 2 1 2 4 1 2
    6 3 2 1 3 2 2 2 1 1 2 1
    2 2 2 . 3 1 1 1 1 1 2 1
    2 3 . 1 2 . . 1 . 1 2 1
    3 4 1 1 2 2 2 1 2 4 2 .
    end
    label values S2_KAT labels1
    label def labels1 2 "18-29", modify
    label def labels1 3 "30-39", modify
    label def labels1 4 "40-49", modify
    label def labels1 5 "50-59", modify
    label def labels1 6 "60 a viac", modify
    label values S4 labels3
    label def labels3 2 "bez_maturity", modify
    label def labels3 3 "maturita", modify
    label def labels3 4 "vysokoskolske", modify
    label values Q27A labels49
    label def labels49 1 "Ivan Korčok", modify
    label def labels49 2 "Peter Pellegrini", modify
    label values MENTALITY MENTALITY
    label def MENTALITY 0 "SK not backward", modify
    label def MENTALITY 1 "SK backward", modify
    label values CITY CITY
    label def CITY 1 "Village (up to 4999)", modify
    label def CITY 2 "City (from 5000)", modify
    label def CITY 3 "more than 100000", modify
    label values EU EU
    label def EU 1 "NO EU", modify
    label def EU 2 "PRO EU", modify
    label values NATO NATO
    label def NATO 1 "NO NATO", modify
    label def NATO 2 "PRO NATO", modify
    label values COALITION COALITION
    label def COALITION 1 "Opposition", modify
    label def COALITION 2 "In Power", modify
    label values LEFTRIGHT LEFTRIGHT
    label def LEFTRIGHT 1 "Left", modify
    label def LEFTRIGHT 2 "right", modify
    label values MONEY MONEY
    label def MONEY 1 "up to 1050 eur", modify
    label def MONEY 2 "1 051 € to 1 400 €", modify
    label def MONEY 3 "1 401 € to 2 500 €", modify
    label def MONEY 4 "more than 2501 eur", modify
    label values CLASS CLASS
    label def CLASS 1 "White collars", modify
    label def CLASS 2 "Workers", modify
    label values END_MONTH END_MONTH
    label def END_MONTH 1 "Difficult", modify
    label def END_MONTH 2 "Easy", modify
    I have tried to follow the example above, but I get stuck every time. Who can suggest the correct syntax?

    thank you for your attention!

  • #2
    UPDATE: I am not sure that the strategy in Example 3 is the best one for me. I would like to perform clustering with categorical variables, totaling 31 options (as in the sample dataset posted above). However, the leaves of my dendrogram following the example in STATA are more than 200, which is a mistake. Can somebody help?

    Comment


    • #3
      Click image for larger version

Name:	ok....png
Views:	1
Size:	78.1 KB
ID:	1756092


      this is what i need to do, a dendogram with each variables values. how can i do it with stata?

      Comment


      • #4
        You need to be clear first that you have a way to judge similarity or dissimilarity between variables on quite different scales. On the face of it some are nominal and some are ordinal.

        Comment


        • #5
          Originally posted by Nick Cox View Post
          You need to be clear first that you have a way to judge similarity or dissimilarity between variables on quite different scales. On the face of it some are nominal and some are ordinal.
          thank you for giving a look! I could just use nominal to make it easier. I think there is just one ordinal one, that i can exclude, the rest are either dummies or categorical/nominal.
          Do you have an idea how i can compute this? Following the example above I am not going very far.

          Comment


          • #6
            For sure i can proceeded and splitting each variables in, let's MENTALITY_0 and MENTALITY_1 etc etc, but then what?

            Comment


            • #7
              Sorry, but I don't have a clear idea of what you want to do — or a better idea of what to do. In principle, we should be interested in clusterings that expose cluster structure. In practice, with your kind of data, they are rarely helpful.
              Last edited by Nick Cox; 16 Jun 2024, 03:40.

              Comment


              • #8
                thank you for checking again!

                the dendogram above has been done with my data, and it makes sense (to me at least ahah), my goal here would be to replicate in stata, and i am not sure how

                the idea is to see how each categorical variable answers correlate with each other. in the example above, all the categorical variables have been spitted (like the dummy END_MONTH_easy or END_MONTH_difficult), and a deprogram has been created: we see that END_MONTH_difficult is closer to people that live in villages. etc.

                I would like to reproduce this in stata. I tried to follow the documentation, but I am not sure what I am doing wrong...

                Comment


                • #9
                  This thread is still puzzling to me because the question keeps changing, or at least I don't follow the twists and turns of different posts.

                  Perhaps as extreme cases

                  1. You've got results in Stata but you don't believe them. #2

                  2. You've got results outside Stata and you want to replicate them in Stata. #8

                  I don't have any suggestion under either or both scenarios. You don't give any code that you used, whether inside or outside Stata.

                  The label definitions alone in #1 suggest 5 x 3 x 2 x 2 x 3 x 2 x 2 x 2 x 2 x 4 x 2 x 2 possible cross-combinations I make that 46080
                  and even if many don't occur, a dendrogram with hundreds of leaves doesn't surprise me. (If I missed some or double counted, sorry, but I think the point is unaffected.)

                  This sort of problem is one of several reasons why clustering is often disappointing for categorical data, even if the variables are measured consistently. For example, even 10 binary variables means 2^10 = 1024 cross-combinations. If you're lucky a dendrogram has simple form, but that is not guaranteed.

                  Comment


                  • #10
                    ahaha i am sorry! the question, at least in my mind, is the same. my goal is literally to reproduce in stata the dendogram i showed above!

                    This is the python that generated the image:

                    Code:
                    import pandas as pd
                    from sklearn.preprocessing import OneHotEncoder
                    from scipy.cluster.hierarchy import linkage, dendrogram
                    import matplotlib.pyplot as plt
                    from scipy.spatial.distance import pdist
                    
                    # Load the dataset
                    file_path = '/Users/kome/Downloads/Alchimia/Prova_ChatGPT3.csv'
                    data = pd.read_csv(file_path)
                    
                    # Specify the columns of interest including 'MENTALITY'
                    selected_columns_with_mentality = ['CITY', 'EU', 'V4', 'NATO', 'INT_POL', 'COALITION', 'LEFTRIGHT', 'MONEY', 'CLASS', 'END_MONTH', 'MENTALITY']
                    
                    # Filter the data to include only the selected columns
                    data_selected_with_mentality = data[selected_columns_with_mentality].dropna()
                    
                    # One-hot encode the selected categorical variables
                    encoder = OneHotEncoder()
                    encoded_selected_data_with_mentality = encoder.fit_transform(data_selected_with_mentality)
                    
                    # Transpose the encoded data to treat categories as observations
                    binary_matrix_selected_with_mentality = encoded_selected_data_with_mentality.T
                    
                    # Compute the distance matrix for the binary matrix
                    distance_matrix_selected_with_mentality = pdist(binary_matrix_selected_with_mentality.toarray(), metric='hamming')
                    
                    # Define linkage methods
                    linkage_methods = ['single', 'complete', 'weighted', 'centroid']
                    
                    # Generate and plot dendrograms for each linkage method
                    for method in linkage_methods:
                        # Perform hierarchical clustering using the current linkage method
                        linked_method = linkage(distance_matrix_selected_with_mentality, method=method)
                        
                        # Plot the dendrogram with the current linkage method
                        plt.figure(figsize=(12, 8))
                        dendrogram(linked_method, labels=encoder.get_feature_names_out(selected_columns_with_mentality), orientation='top', leaf_rotation=90)
                        plt.title(f'Hierarchical Clustering Dendrogram ({method.capitalize()} Linkage)')
                        plt.xlabel('Categories')
                        plt.ylabel('Distance')
                        plt.show()
                    Ant it works!!! but i cannot figure out how to to transpose and compute the distance matrix in stata. there should be a method, but i cannot figure out how!

                    Checking the documentation, basically i need to reproduce example 3: https://www.stata.com/manuals/mvclusterlinkage.pdf Basically i want to turn every of my categorical in a dummy, and see how they are "correlated" with a dedrogram

                    Anyway! I am sorry for expressing myself in a confused way. But if anyone see what i mean, i would need some help!


                    Last edited by Martinо Cоmelli; 17 Jun 2024, 04:17.

                    Comment

                    Working...
                    X