Problem with PCA

Prerna Pandey

Join Date: Apr 2020

Posts: 2
#1

Problem with PCA

10 Apr 2020, 02:30

Hi all! My research is on the impact of immunization coverage on child mortality in India. There is data available on different vaccines like bcg, dpt, measles, etc.(all in %). I have done PCA and then constructed an index using the PC1. I have run a simple linear regression of my y variable i.e. U5MR on the index and other explanatory variables such as sanitation,dietary intake, mother's education,etc. I am not able to figure out the interpretation of the coefficient of the index constructed as a result of the regression

Last edited by Prerna Pandey; 10 Apr 2020, 02:36.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35448
#2

10 Apr 2020, 02:44

PC1 is a weighted composite of all the variables fed to the PCA. At best it is an overall summary of the main pattern in those data. At worst it is a mishmash. If you want to understand more about PC1 you may need to

* plot PC1 against other variables, especially those loading highly on it (that may signal problems with outliers, skewness, groupings, nonlinearity, etc. which need to be worried about)

* look at correlations ditto

* consider whether just one of the variables loading highly may work just about as well -- and be easier to talk about.

For a technique easy to describe -- calculate the eigenvalues and eigenvectors of a correlation or covariance matrix -- PCA is harder to understand without a lot of practice. People sometimes have unjustifiable faith that you feed dirty data in and get clean patterns out. But it's not a washing machine: the dirt just gets redistributed.
1 like
Comment
Prerna Pandey

Join Date: Apr 2020

Posts: 2
#3

10 Apr 2020, 03:10

Okay. I'll work on it. Thanks
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35448

10 Apr 2020, 04:08

Here is a silly example but the technique is more general. I use only variables connected with car size from the auto data. So, this is shooting fish in a barrel: PC1 does quite well and can be interpreted as an overall measure of size. The eigenvectors are informative but I prefer to scale them to correlations. I (and I suspect most of my readers) have a much better feel for correlations than for the elements of eigenvectors. I know how to think about correlations in terms of -1 0 1 and to relate any correlation to the corresponding scatter plot.

To look at correlations and scatter plots for PC1 and the original variables, I could use correlate and graph matrix and ignore results not of direct interest. But here I use crossplot and cpcorr from SSC for more focused output.

For more on those two extra commands, see

https://www.statalist.org/forums/for...ailable-on-ssc

https://www.statalist.org/forums/for...f-correlations

Code:

. sysuse auto, clear
(1978 Automobile Data)

. pca headroom trunk length displacement

Principal components/correlation                 Number of obs    =         74
                                                 Number of comp.  =          4
                                                 Trace            =          4
    Rotation: (unrotated = principal)            Rho              =     1.0000

    --------------------------------------------------------------------------
       Component |   Eigenvalue   Difference         Proportion   Cumulative
    -------------+------------------------------------------------------------
           Comp1 |      2.92212      2.28986             0.7305       0.7305
           Comp2 |      .632263       .32651             0.1581       0.8886
           Comp3 |      .305754      .165892             0.0764       0.9650
           Comp4 |      .139861            .             0.0350       1.0000
    --------------------------------------------------------------------------

Principal components (eigenvectors)

    --------------------------------------------------------------------
        Variable |    Comp1     Comp2     Comp3     Comp4 | Unexplained
    -------------+----------------------------------------+-------------
        headroom |   0.4446    0.7399    0.4944   -0.1022 |           0
           trunk |   0.5142    0.2423   -0.7596    0.3160 |           0
          length |   0.5328   -0.3755   -0.0732   -0.7548 |           0
    displacement |   0.5041   -0.5028    0.4161    0.5656 |           0
    --------------------------------------------------------------------

. predict PC1
(score assumed)
(3 components skipped)

Scoring coefficients
    sum of squares(column-loading) = 1

    ------------------------------------------------------
        Variable |    Comp1     Comp2     Comp3     Comp4
    -------------+----------------------------------------
        headroom |   0.4446    0.7399    0.4944   -0.1022
           trunk |   0.5142    0.2423   -0.7596    0.3160
          length |   0.5328   -0.3755   -0.0732   -0.7548
    displacement |   0.5041   -0.5028    0.4161    0.5656
    ------------------------------------------------------

. ssc install crossplot


. crossplot (PC1) headroom trunk length displacement

. ssc install cpcorr


. cpcorr  headroom trunk length displacement \ PC1
(obs=74)

                 PC1
    headroom  0.7601
       trunk  0.8789
      length  0.9108
displacement  0.8617

Here's the graph:

Click image for larger version

Name: crossplot.png
Views: 1
Size: 56.8 KB
ID: 1545941

To me this says: Don't use PC1, use length. It conveys almost the same information and is easier to think and write about.

More generally, if you have a group of closely related variables, consider just choosing one as a predictor. If anyone wants to suggest that this advice does not hinge on using or thinking about PCA, I agree.

The question otherwise is: If you just look at the relationships between the original variables using correlate and graph matrix, is it quite so obvious which variable is nearest the main pattern?

In this example, a more subtle analysis would think about more dimensions, as two variables are dimensionally length and two dimensionally volume, but that nuance is orthogonal to my main purpose now.

Announcement

Comment

Comment

Comment