
No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with PCA

    Hi all,

    I'm trying to conduct a principal component factor analysis on eight indices relating to mental and physical health (four each). This produces two factors (with eigenvalues greater than 1), where one factor loads heavily on psychological domains (mental health, vitality, role-emotional) and the other on physical domains (physical functioning, role-physical, bodily pain). I want to then use the first factor as a measure of mental health and perform further analysis on this (for example, look at average mental health scores in specific years).

    I want to clarify whether it is correct to use the values predicted from the first factor (named 'sf1' in the code below) for further analysis of the mental health measure? The indices used to perform PCA are such that higher values indicate better health, so can I assume the same for the predicted scores from the PCA, where higher values would indicate better mental health?

     pca vitality so_function em_role men_health ph_function ph_role pain gen_health, mineigen(1
    > )
    Principal components/correlation                 Number of obs    =    110,393
                                                     Number of comp.  =          2
                                                     Trace            =          8
        Rotation: (unrotated = principal)            Rho              =     0.6776
           Component |   Eigenvalue   Difference         Proportion   Cumulative
               Comp1 |      4.34624      3.27128             0.5433       0.5433
               Comp2 |      1.07496      .366809             0.1344       0.6776
               Comp3 |      .708147        .1829             0.0885       0.7662
               Comp4 |      .525248      .108535             0.0657       0.8318
               Comp5 |      .416712     .0438174             0.0521       0.8839
               Comp6 |      .372895     .0554139             0.0466       0.9305
               Comp7 |      .317481     .0791587             0.0397       0.9702
               Comp8 |      .238322            .             0.0298       1.0000
    Principal components (eigenvectors) 
            Variable |    Comp1     Comp2 | Unexplained 
            vitality |   0.3799   -0.2486 |       .3064 
         so_function |   0.4010   -0.1392 |       .2802 
             em_role |   0.3235   -0.3743 |       .3945 
          men_health |   0.3545   -0.4980 |       .1871 
         ph_function |   0.3019    0.4935 |        .342 
             ph_role |   0.3529    0.3306 |       .3414 
                pain |   0.3452    0.4049 |       .3059 
          gen_health |   0.3601    0.1180 |       .4214 
    . rotate, varimax 
    Principal components/correlation                 Number of obs    =    110,393
                                                     Number of comp.  =          2
                                                     Trace            =          8
        Rotation: orthogonal varimax (Kaiser off)    Rho              =     0.6776
           Component |     Variance   Difference         Proportion   Cumulative
               Comp1 |      2.81438      .207557             0.3518       0.3518
               Comp2 |      2.60682            .             0.3259       0.6776
    Rotated components 
            Variable |    Comp1     Comp2 | Unexplained 
            vitality |   0.4471    0.0787 |       .3064 
         so_function |   0.3877    0.1729 |       .2802 
             em_role |   0.4921   -0.0516 |       .3945 
          men_health |   0.5993   -0.1205 |       .1871 
         ph_function |  -0.1176    0.5665 |        .342 
             ph_role |   0.0311    0.4825 |       .3414 
                pain |  -0.0254    0.5315 |       .3059 
          gen_health |   0.1818    0.3325 |       .4214 
    Component rotation matrix
                     |    Comp1     Comp2 
               Comp1 |   0.7292    0.6843 
               Comp2 |  -0.6843    0.7292 
    . predict sf1 sf2, score 
    Scoring coefficients for orthogonal varimax rotation
        sum of squares(column-loading) = 1
            Variable |    Comp1     Comp2 
            vitality |   0.4471    0.0787 
         so_function |   0.3877    0.1729 
             em_role |   0.4921   -0.0516 
          men_health |   0.5993   -0.1205 
         ph_function |  -0.1176    0.5665 
             ph_role |   0.0311    0.4825 
                pain |  -0.0254    0.5315 
          gen_health |   0.1818    0.3325 

    Here is a subset of my data for the predicted scores:

    * Example generated by -dataex-. For more info, type help dataex
    input float(sf1 sf2)
       1.581653 1.0866716
      1.1197572  .7934264
      1.0202587  .8034239
       .8457245   .806613
       .8329989  1.344683
       .4699488 .54378575
              .         .
    -.003876648   .682789
      -.5739089 -2.499006
     -1.4357677 -2.774586
     -1.3355894 -3.725319
       -3.05685 -4.913719
      -2.529988 -3.770534
     -1.3062733 -3.011108
     -.57820594 -3.058407
              .         .
     -1.1788896 1.6541096
        1.49292  .4379246
      .56137186 .53877926
      -.3444249 -.0369712
    Last edited by Ashani Abayasekara; 09 Apr 2024, 02:27.

  • #2
    You're in tricky territory. For the unrotated solution PC1 is essentially an overall average of all measures, The coefficients are very close! It's only when you rotate it that you get closer to your interpretation. Reviewers might well be divided on the merits of what you're doing.

    I won't be a reviewer, but I rarely find that a PC that is a mishmash of the original variables helps interpretation as compared with using the original variables.

    I find it helpful -- if I am looking at a PCA, although as implied I usually move on -- also to look at the correlations between the PCs and the original variables.

    cpcorr from SSC is convenient for this, although correlate will serve too.

    Here is a silly example using the auto data. I loaded (pun intended) the PCA with 3 "size" variables out of 5 and (surprise!) they pop up related to PC1.

    . sysuse auto, clear
    (1978 automobile data)
    . ds
    make          price         mpg           rep78         headroom      trunk         weight        length        turn          displacement  gear_ratio    foreign
    . pca trunk length weight mpg price
    Principal components/correlation                 Number of obs    =         74
                                                     Number of comp.  =          5
                                                     Trace            =          5
        Rotation: (unrotated = principal)            Rho              =     1.0000
           Component |   Eigenvalue   Difference         Proportion   Cumulative
               Comp1 |      3.58052      2.84488             0.7161       0.7161
               Comp2 |      .735641      .322635             0.1471       0.8632
               Comp3 |      .413006      .185495             0.0826       0.9458
               Comp4 |      .227511      .184184             0.0455       0.9913
               Comp5 |     .0433265            .             0.0087       1.0000
    Principal components (eigenvectors) 
            Variable |    Comp1     Comp2     Comp3     Comp4     Comp5 | Unexplained 
               trunk |   0.4166   -0.4004    0.7703   -0.2585   -0.0765 |           0 
              length |   0.5001   -0.1990   -0.1567    0.4396    0.7018 |           0 
              weight |   0.5050   -0.0274   -0.2025    0.4602   -0.7011 |           0 
                 mpg |  -0.4650    0.0208    0.5056    0.7265   -0.0049 |           0 
               price |   0.3243    0.8938    0.2922   -0.0207    0.1006 |           0 
    . predict PC1
    (score assumed)
    (4 components skipped)
    Scoring coefficients 
        sum of squares(column-loading) = 1
            Variable |    Comp1     Comp2     Comp3     Comp4     Comp5 
               trunk |   0.4166   -0.4004    0.7703   -0.2585   -0.0765 
              length |   0.5001   -0.1990   -0.1567    0.4396    0.7018 
              weight |   0.5050   -0.0274   -0.2025    0.4602   -0.7011 
                 mpg |  -0.4650    0.0208    0.5056    0.7265   -0.0049 
               price |   0.3243    0.8938    0.2922   -0.0207    0.1006 
    . cpcorr trunk length weight mpg price \ PC1
     trunk   0.7884
    length   0.9463
    weight   0.9555
       mpg  -0.8798
     price   0.6136
    On this evidence alone I might decide to use weight (not PC1) as my best single measure of size. That could be decided from the eigenvectors. But my point is that the correlations are often easier for readers to think about than the eigenvectors. Nothing stops you using all available evidence.


    • #3
      Hi Nick,

      Thank you very much for this explanation, it really helped me understand PCA better and the pros and cons of using it.

      Just a quick follow-up question: when you say the decision to use 'weight' as the single best measure of size can also be decided from the eigenvectors (apart from the correlation values), is this because the value of weight for Comp1 in the eigenvector table is the highest at 0.5050?



      • #4
        Yes. It's the largest coefficient -- in absolute value.

