Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • principal component analysis for cluster analysis

    Hello world,
    I have a dataset of 500 records. Each record contains 20 scale variables and 10 dichotomous variables.
    I want to divide these 500 records into 3 groups and plot them on a 2-dimensional X-Y plane.
    I chose 2 dimension because this is more intuitive than 3 or more dimensions.

    Q1. Then would it make sense to generate X and Y coordinate by using pca, and use these coordinates in the subsequent cluster analysis?

    Q2. If so, should I better normalise the scale variables, say, by using boxcox transformation?

    Q3. Can I input the dichotomous variables into pca as they are?

    Cheers

    Yoshi Nagao

  • #2
    I have a dataset of 500 records. Each record contains 20 scale variables and 10 dichotomous variables.
    I want to divide these 500 records into 3 groups and plot them on a 2-dimensional X-Y plane.
    I chose 2 dimension because this is more intuitive than 3 or more dimensions.
    Why 3 groups? What makes you confident that it makes sense to identify groups, rather than expecting a continuum of variation, and if so why 3 groups, and not 2, or 7 or 33?

    Q1. Then would it make sense to generate X and Y coordinate by using pca, and use these coordinates in the subsequent cluster analysis?
    Impossible to say without seeing some results. You are guaranteed that 2 PCs from 30 variables account for at least 2/30 or about 7% of the total variability. You might do much better than that, or you might not.

    Q2. If so, should I better normalise the scale variables, say, by using boxcox transformation?
    Why do you think that might help? Detail: Box-Cox please. We Coxes stick up for each other.

    Q3. Can I input the dichotomous variables into pca as they are?
    I don't think there is any alternative given that you input them at all. Any one-to-one transformation will at most change the sign of any correlations.

    Comment


    • #3
      Originally posted by Nick Cox View Post
      Why 3 groups? What makes you confident that it makes sense to identify groups, rather than expecting a continuum of variation, and if so why 3 groups, and not 2, or 7 or 33?
      Impossible to say without seeing some results. You are guaranteed that 2 PCs from 30 variables account for at least 2/30 or about 7% of the total variability. You might do much better than that, or you might not.
      Why do you think that might help? Detail: Box-Cox please. We Coxes stick up for each other.
      I don't think there is any alternative given that you input them at all. Any one-to-one transformation will at most change the sign of any correlations.
      ---------------------------------------------
      Dear Nick

      Thank you very much for your swift response.
      I should have put my question more in detail.
      My interest was sparked by Lancet Child Adolesc Health. 2023;7(10):697-707. doi: 10.1016/S2352-4642(23)00166-9. In particular, the following part in the methods section: “we calculated Hopkin’s statistic with the R package factoextra version 1.0.7 to quantify the clustering tendency of phenotypic data. A value close to one indicates that the data are highly clustered, random data will tend to result in values around 0·5, and uniformly distributed data will result in values close to zero. We used hierarchical clustering on principal components for cluster analysis, which involves normalisation of input data, principal component analysis, hierarchical clustering, parcellation of hierarchical tree by the optimal number of clusters, and result consolidation by k-means clustering (an algorithm that partitions the samples into k clusters based on smallest mean distances to cluster centres). Principal component analysis was done on the 14 selected continuous variables. All categorical variables …were used as supplementary variables that neither participated in dimensionality reduction nor affected the cluster results, but only assisted in interpretation of the clinical characteristics of the identified clusters. The optimal number of clusters was determined by comparing the inertia, which measures the sum of squared distance of samples to their closest cluster centre and quantifies how well a dataset is clustered. The cluster analysis was computed by the FactoMineR package version 2.7 for R.”

      In this section, the authors used Hopkin statistics to quantify the clustering tendency and “inertia” to determine the optimal number of clusters. I do not know how to do these in Stata. Instead, I used my instinct. When we attempt to divide something which has been assumed to be a single entity, we should better be conservative. So, I thought starting from “3” is better than starting from a bigger number of groups. Similarly, I want to shrink the dimensions to 2 principal components, because we are accustomed to X-Y graphs, but not to X-Y-Z graphs.

      I have many questions about this method section:
      First they used “hierarchical clustering”. However, “k-means clustering” is a non-hierarchical clustering, isn’t it?

      Second, “all categorical variables .. (did not) participate in dimensionality reduction … only assisted in interpretation..“ Why categorical variables (including dichotomous variable) can not participate in the principal component analysis and the subsequent cluster analysis? Even sex which is an extremely important biological variable.
      Third, the authors “normalised input data”. It seems that the Stata reference manual does not give a definitive answer when to normalise and when not to.

      As for the method of Box and Cox, I often use boxcox command to obtain the coefficients which achieve a symmetrical distribution.

      Your further assistance would be appreciated.

      Yoshiro

      Comment


      • #4
        In effect you're saying that you will start with 3 clusters and explore. That makes sense.

        k-means is indeed not hierarchical.

        Many researchers are queasy about mixing -- in your terms -- scale and dichotomous variables. You could argue the point either way.

        I don't see Box-Cox as being about the distribution of coefficients, more about the data, which Is what I suspect you mean. I can't see that pulling in tails is necessarily helpful for classification. Asymmetric variation is part of what you're classifying.

        There are many biases and prejudices about classification and here are a few of mine. On most measures I expect a continuum of variation, not distinct groups. Classifications may appeal for some purposes but they are often foisted upon the data to suit the researcher's goals.

        Comment


        • #5
          Originally posted by Nick Cox View Post
          I don't see Box-Cox as being about the distribution of coefficients, more about the data, which Is what I suspect you mean. I can't see that pulling in tails is necessarily helpful for classification. Asymmetric variation is part of what you're classifying.

          There are many biases and prejudices about classification and here are a few of mine. On most measures I expect a continuum of variation, not distinct groups. Classifications may appeal for some purposes but they are often foisted upon the data to suit the researcher's goals.
          Normalisation: I will incorporate the independent variables as they are.
          Dichotomous variables: I would use the dichotomous/categorical variable if doing so could make the result look .. more attractive!

          I completely agree with you that classification (or clustering) can be used for subjective purposes. For example, even if X and Y are not correlated in the entire dataset (n=100, P=0.5), authors may say "the correlation was statistically significant in the subgroup A (n=10, P=0.04)". However, cluster analysis will (hopefully) perfectly distinguish between horses and salamanders, or between pasta and sushi. It should depend upon the conscience of authors whether they adopt the results of the analysis.

          Comment


          • #6
            See my coworkers dissertation chapter for some useful steps, but note that vanilla PCA is quite vulnerable to outliers.

            Comment


            • #7
              Hi Jared, thank you very much for letting me know this important work. However, my knowledge for statistics is completely insufficient to understand this precious dissertation. By the way, can anybody implement "Hoptin's statistic" and "inertia" for Stata? These are indicators for how strongly the clusters are aggregated, as in:

              "To test our hypothesis of clustered clinical features among patients with Kawasaki disease, we calculated Hopkin’s statistic with the R package factoextra version 1.0.7 to quantify the clustering tendency of phenotypic data. A value close to one indicates that the data are highly clustered, random data will tend to result in values around 0·5, and uniformly distributed data will result in values close to zero.[12]

              The optimal number of clusters was determined by comparing the inertia, which measures the sum of squared distance of samples to their closest cluster centre and quantifies how well a dataset is clustered. The cluster analysis was computed by the FactoMineR package version 2.7 for R.[14]

              12 Hopkins B, Skellam JG. A new method for determining the type of distribution of plant individuals. Ann Bot (Lond) 1954; 18: 213–27.
              14 Lê S, Josse J, Husson F. FactoMineR: An R package for multivariate analysis. J Stat Softw 2008; 25: 1–18."



              I agree with Nick in that the data points in the attached article do not seem to cluster into discrete subgroups. I would like to reproduce the statistical process myself. But, I used R only once a few decades ago.

              Yoshiro
              US National Institutes of Health and the Irving and Francine Suknow Foundation.

              Comment

              Working...
              X