Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with PCA

    Hi all! My research is on the impact of immunization coverage on child mortality in India. There is data available on different vaccines like bcg, dpt, measles, etc.(all in %). I have done PCA and then constructed an index using the PC1. I have run a simple linear regression of my y variable i.e. U5MR on the index and other explanatory variables such as sanitation,dietary intake, mother's education,etc. I am not able to figure out the interpretation of the coefficient of the index constructed as a result of the regression
    Last edited by Prerna Pandey; 10 Apr 2020, 02:36.

  • #2
    PC1 is a weighted composite of all the variables fed to the PCA. At best it is an overall summary of the main pattern in those data. At worst it is a mishmash. If you want to understand more about PC1 you may need to

    * plot PC1 against other variables, especially those loading highly on it (that may signal problems with outliers, skewness, groupings, nonlinearity, etc. which need to be worried about)

    * look at correlations ditto

    * consider whether just one of the variables loading highly may work just about as well -- and be easier to talk about.

    For a technique easy to describe -- calculate the eigenvalues and eigenvectors of a correlation or covariance matrix -- PCA is harder to understand without a lot of practice. People sometimes have unjustifiable faith that you feed dirty data in and get clean patterns out. But it's not a washing machine: the dirt just gets redistributed.

    Comment


    • #3
      Okay. I'll work on it. Thanks

      Comment


      • #4
        Here is a silly example but the technique is more general. I use only variables connected with car size from the auto data. So, this is shooting fish in a barrel: PC1 does quite well and can be interpreted as an overall measure of size. The eigenvectors are informative but I prefer to scale them to correlations. I (and I suspect most of my readers) have a much better feel for correlations than for the elements of eigenvectors. I know how to think about correlations in terms of -1 0 1 and to relate any correlation to the corresponding scatter plot.

        To look at correlations and scatter plots for PC1 and the original variables, I could use correlate and graph matrix and ignore results not of direct interest. But here I use crossplot and cpcorr from SSC for more focused output.

        For more on those two extra commands, see

        https://www.statalist.org/forums/for...ailable-on-ssc

        https://www.statalist.org/forums/for...f-correlations



        Code:
        . sysuse auto, clear
        (1978 Automobile Data)
        
        . pca headroom trunk length displacement
        
        Principal components/correlation                 Number of obs    =         74
                                                         Number of comp.  =          4
                                                         Trace            =          4
            Rotation: (unrotated = principal)            Rho              =     1.0000
        
            --------------------------------------------------------------------------
               Component |   Eigenvalue   Difference         Proportion   Cumulative
            -------------+------------------------------------------------------------
                   Comp1 |      2.92212      2.28986             0.7305       0.7305
                   Comp2 |      .632263       .32651             0.1581       0.8886
                   Comp3 |      .305754      .165892             0.0764       0.9650
                   Comp4 |      .139861            .             0.0350       1.0000
            --------------------------------------------------------------------------
        
        Principal components (eigenvectors)
        
            --------------------------------------------------------------------
                Variable |    Comp1     Comp2     Comp3     Comp4 | Unexplained
            -------------+----------------------------------------+-------------
                headroom |   0.4446    0.7399    0.4944   -0.1022 |           0
                   trunk |   0.5142    0.2423   -0.7596    0.3160 |           0
                  length |   0.5328   -0.3755   -0.0732   -0.7548 |           0
            displacement |   0.5041   -0.5028    0.4161    0.5656 |           0
            --------------------------------------------------------------------
        
        . predict PC1
        (score assumed)
        (3 components skipped)
        
        Scoring coefficients
            sum of squares(column-loading) = 1
        
            ------------------------------------------------------
                Variable |    Comp1     Comp2     Comp3     Comp4
            -------------+----------------------------------------
                headroom |   0.4446    0.7399    0.4944   -0.1022
                   trunk |   0.5142    0.2423   -0.7596    0.3160
                  length |   0.5328   -0.3755   -0.0732   -0.7548
            displacement |   0.5041   -0.5028    0.4161    0.5656
            ------------------------------------------------------
        
        . ssc install crossplot
        
        
        . crossplot (PC1) headroom trunk length displacement
        
        . ssc install cpcorr
        
        
        . cpcorr  headroom trunk length displacement \ PC1
        (obs=74)
        
                         PC1
            headroom  0.7601
               trunk  0.8789
              length  0.9108
        displacement  0.8617
        Here's the graph:
        Click image for larger version

Name:	crossplot.png
Views:	1
Size:	56.8 KB
ID:	1545941




        To me this says: Don't use PC1, use length. It conveys almost the same information and is easier to think and write about.

        More generally, if you have a group of closely related variables, consider just choosing one as a predictor. If anyone wants to suggest that this advice does not hinge on using or thinking about PCA, I agree.

        The question otherwise is: If you just look at the relationships between the original variables using correlate and graph matrix, is it quite so obvious which variable is nearest the main pattern?

        In this example, a more subtle analysis would think about more dimensions, as two variables are dimensionally length and two dimensionally volume, but that nuance is orthogonal to my main purpose now.

        Comment

        Working...
        X