Help with PCA

Ashani Abayasekara

Join Date: May 2023
Posts: 106

09 Apr 2024, 01:23

Hi all,

I'm trying to conduct a principal component factor analysis on eight indices relating to mental and physical health (four each). This produces two factors (with eigenvalues greater than 1), where one factor loads heavily on psychological domains (mental health, vitality, role-emotional) and the other on physical domains (physical functioning, role-physical, bodily pain). I want to then use the first factor as a measure of mental health and perform further analysis on this (for example, look at average mental health scores in specific years).

I want to clarify whether it is correct to use the values predicted from the first factor (named 'sf1' in the code below) for further analysis of the mental health measure? The indices used to perform PCA are such that higher values indicate better health, so can I assume the same for the predicted scores from the PCA, where higher values would indicate better mental health?

Code:

 pca vitality so_function em_role men_health ph_function ph_role pain gen_health, mineigen(1
> )

Principal components/correlation                 Number of obs    =    110,393
                                                 Number of comp.  =          2
                                                 Trace            =          8
    Rotation: (unrotated = principal)            Rho              =     0.6776

    --------------------------------------------------------------------------
       Component |   Eigenvalue   Difference         Proportion   Cumulative
    -------------+------------------------------------------------------------
           Comp1 |      4.34624      3.27128             0.5433       0.5433
           Comp2 |      1.07496      .366809             0.1344       0.6776
           Comp3 |      .708147        .1829             0.0885       0.7662
           Comp4 |      .525248      .108535             0.0657       0.8318
           Comp5 |      .416712     .0438174             0.0521       0.8839
           Comp6 |      .372895     .0554139             0.0466       0.9305
           Comp7 |      .317481     .0791587             0.0397       0.9702
           Comp8 |      .238322            .             0.0298       1.0000
    --------------------------------------------------------------------------

Principal components (eigenvectors) 

    ------------------------------------------------
        Variable |    Comp1     Comp2 | Unexplained 
    -------------+--------------------+-------------
        vitality |   0.3799   -0.2486 |       .3064 
     so_function |   0.4010   -0.1392 |       .2802 
         em_role |   0.3235   -0.3743 |       .3945 
      men_health |   0.3545   -0.4980 |       .1871 
     ph_function |   0.3019    0.4935 |        .342 
         ph_role |   0.3529    0.3306 |       .3414 
            pain |   0.3452    0.4049 |       .3059 
      gen_health |   0.3601    0.1180 |       .4214 
    ------------------------------------------------

. rotate, varimax 

Principal components/correlation                 Number of obs    =    110,393
                                                 Number of comp.  =          2
                                                 Trace            =          8
    Rotation: orthogonal varimax (Kaiser off)    Rho              =     0.6776

    --------------------------------------------------------------------------
       Component |     Variance   Difference         Proportion   Cumulative
    -------------+------------------------------------------------------------
           Comp1 |      2.81438      .207557             0.3518       0.3518
           Comp2 |      2.60682            .             0.3259       0.6776
    --------------------------------------------------------------------------

Rotated components 

    ------------------------------------------------
        Variable |    Comp1     Comp2 | Unexplained 
    -------------+--------------------+-------------
        vitality |   0.4471    0.0787 |       .3064 
     so_function |   0.3877    0.1729 |       .2802 
         em_role |   0.4921   -0.0516 |       .3945 
      men_health |   0.5993   -0.1205 |       .1871 
     ph_function |  -0.1176    0.5665 |        .342 
         ph_role |   0.0311    0.4825 |       .3414 
            pain |  -0.0254    0.5315 |       .3059 
      gen_health |   0.1818    0.3325 |       .4214 
    ------------------------------------------------

Component rotation matrix

    ----------------------------------
                 |    Comp1     Comp2 
    -------------+--------------------
           Comp1 |   0.7292    0.6843 
           Comp2 |  -0.6843    0.7292 
    ----------------------------------

. predict sf1 sf2, score 

Scoring coefficients for orthogonal varimax rotation
    sum of squares(column-loading) = 1

    ----------------------------------
        Variable |    Comp1     Comp2 
    -------------+--------------------
        vitality |   0.4471    0.0787 
     so_function |   0.3877    0.1729 
         em_role |   0.4921   -0.0516 
      men_health |   0.5993   -0.1205 
     ph_function |  -0.1176    0.5665 
         ph_role |   0.0311    0.4825 
            pain |  -0.0254    0.5315 
      gen_health |   0.1818    0.3325 
    ----------------------------------

Here is a subset of my data for the predicted scores:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(sf1 sf2)
   1.581653 1.0866716
  1.1197572  .7934264
  1.0202587  .8034239
   .8457245   .806613
   .8329989  1.344683
   .4699488 .54378575
          .         .
-.003876648   .682789
  -.5739089 -2.499006
 -1.4357677 -2.774586
 -1.3355894 -3.725319
   -3.05685 -4.913719
  -2.529988 -3.770534
 -1.3062733 -3.011108
 -.57820594 -3.058407
          .         .
 -1.1788896 1.6541096
    1.49292  .4379246
  .56137186 .53877926
  -.3444249 -.0369712
end

Last edited by Ashani Abayasekara; 09 Apr 2024, 01:27.

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35698

09 Apr 2024, 03:22

You're in tricky territory. For the unrotated solution PC1 is essentially an overall average of all measures, The coefficients are very close! It's only when you rotate it that you get closer to your interpretation. Reviewers might well be divided on the merits of what you're doing.

I won't be a reviewer, but I rarely find that a PC that is a mishmash of the original variables helps interpretation as compared with using the original variables.

I find it helpful -- if I am looking at a PCA, although as implied I usually move on -- also to look at the correlations between the PCs and the original variables.

cpcorr from SSC is convenient for this, although correlate will serve too.

Here is a silly example using the auto data. I loaded (pun intended) the PCA with 3 "size" variables out of 5 and (surprise!) they pop up related to PC1.

Code:

. sysuse auto, clear
(1978 automobile data)

. ds
make          price         mpg           rep78         headroom      trunk         weight        length        turn          displacement  gear_ratio    foreign

. pca trunk length weight mpg price

Principal components/correlation                 Number of obs    =         74
                                                 Number of comp.  =          5
                                                 Trace            =          5
    Rotation: (unrotated = principal)            Rho              =     1.0000

    --------------------------------------------------------------------------
       Component |   Eigenvalue   Difference         Proportion   Cumulative
    -------------+------------------------------------------------------------
           Comp1 |      3.58052      2.84488             0.7161       0.7161
           Comp2 |      .735641      .322635             0.1471       0.8632
           Comp3 |      .413006      .185495             0.0826       0.9458
           Comp4 |      .227511      .184184             0.0455       0.9913
           Comp5 |     .0433265            .             0.0087       1.0000
    --------------------------------------------------------------------------

Principal components (eigenvectors) 

    ------------------------------------------------------------------------------
        Variable |    Comp1     Comp2     Comp3     Comp4     Comp5 | Unexplained 
    -------------+--------------------------------------------------+-------------
           trunk |   0.4166   -0.4004    0.7703   -0.2585   -0.0765 |           0 
          length |   0.5001   -0.1990   -0.1567    0.4396    0.7018 |           0 
          weight |   0.5050   -0.0274   -0.2025    0.4602   -0.7011 |           0 
             mpg |  -0.4650    0.0208    0.5056    0.7265   -0.0049 |           0 
           price |   0.3243    0.8938    0.2922   -0.0207    0.1006 |           0 
    ------------------------------------------------------------------------------

. predict PC1
(score assumed)
(4 components skipped)

Scoring coefficients 
    sum of squares(column-loading) = 1

    ----------------------------------------------------------------
        Variable |    Comp1     Comp2     Comp3     Comp4     Comp5 
    -------------+--------------------------------------------------
           trunk |   0.4166   -0.4004    0.7703   -0.2585   -0.0765 
          length |   0.5001   -0.1990   -0.1567    0.4396    0.7018 
          weight |   0.5050   -0.0274   -0.2025    0.4602   -0.7011 
             mpg |  -0.4650    0.0208    0.5056    0.7265   -0.0049 
           price |   0.3243    0.8938    0.2922   -0.0207    0.1006 
    ----------------------------------------------------------------


. cpcorr trunk length weight mpg price \ PC1
(obs=74)

            PC1
 trunk   0.7884
length   0.9463
weight   0.9555
   mpg  -0.8798
 price   0.6136

On this evidence alone I might decide to use weight (not PC1) as my best single measure of size. That could be decided from the eigenvectors. But my point is that the correlations are often easier for readers to think about than the eigenvectors. Nothing stops you using all available evidence.

Comment

Ashani Abayasekara

Join Date: May 2023

Posts: 106
#3

09 Apr 2024, 16:10

Hi Nick,

Thank you very much for this explanation, it really helped me understand PCA better and the pros and cons of using it.

Just a quick follow-up question: when you say the decision to use 'weight' as the single best measure of size can also be decided from the eigenvectors (apart from the correlation values), is this because the value of weight for Comp1 in the eigenvector table is the highest at 0.5050?

Thanks,
Ashani
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

10 Apr 2024, 00:48

Yes. It's the largest coefficient -- in absolute value.
Comment

Announcement

Help with PCA

Comment

Comment

Comment