pcacoefsave available from SSC (a utility for pca users)

Nick Cox

Join Date: Mar 2014
Posts: 35698

pcacoefsave available from SSC (a utility for pca users)

01 Jun 2015, 04:46

Thanks as usual to Kit Baum, a new package pcacoefsave may be downloaded from SSC using

Code:

ssc inst pcacoefsave

The aim is simple: to allow saving certain key results from PCA as obtained with the pca command to a new dataset, thereby making much easier various kinds of tabulation and graphing.

Stata 9 is required.

As commented in the help, the command does not extend to factor analysis. Users interested in factor analysis are likely to be using SEM any way and what they need or want may well be quite different and in any case beyond my experience.

Here is a simple example. We first throw various variables related to size in some sense from the auto data into a pca:

Code:

 
. sysuse auto, clear
(1978 Automobile Data)

. pca headroom trunk weight length displacement

Principal components/correlation                 Number of obs    =         74
                                                 Number of comp.  =          5
                                                 Trace            =          5
    Rotation: (unrotated = principal)            Rho              =     1.0000

    --------------------------------------------------------------------------
       Component |   Eigenvalue   Difference         Proportion   Cumulative
    -------------+------------------------------------------------------------
           Comp1 |      3.76201        3.026             0.7524       0.7524
           Comp2 |      .736006      .427915             0.1472       0.8996
           Comp3 |      .308091      .155465             0.0616       0.9612
           Comp4 |      .152627      .111357             0.0305       0.9917
           Comp5 |     .0412693            .             0.0083       1.0000
    --------------------------------------------------------------------------

Principal components (eigenvectors) 

    ------------------------------------------------------------------------------
        Variable |    Comp1     Comp2     Comp3     Comp4     Comp5 | Unexplained 
    -------------+--------------------------------------------------+-------------
        headroom |   0.3587    0.7640    0.5224   -0.1209    0.0130 |           0 
           trunk |   0.4334    0.3665   -0.7676    0.2914    0.0612 |           0 
          weight |   0.4842   -0.3329    0.0737   -0.2669    0.7603 |           0 
          length |   0.4863   -0.2372   -0.1050   -0.5745   -0.6051 |           0 
    displacement |   0.4610   -0.3390    0.3484    0.7065   -0.2279 |           0 
    ------------------------------------------------------------------------------

. pcacoefsave using pca_results
file pca_results.dta saved

. use pca_results

. describe 

Contains data from pca_results.dta
  obs:            25                          
 vars:             8                          1 Jun 2015 11:21
 size:           575                          
---------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------
varname         byte    %12.0g     names      variable
varlabel        byte    %22.0g     labels     variable
PC              byte    %8.0g                 
corr            float   %9.0g                 correlation
loading         float   %9.0g                 coefficient
eigenvalue      float   %9.0g                 
mean            float   %9.0g                 
SD              float   %9.0g                 standard deviation
---------------------------------------------------------------------------------------
Sorted by:

One thing I like to do, which doesn't seem common in texts or papers, is to look at the correlations between the original variables and the components. I don't need the correlations between the PCs because these are 1 or 0 by definition. After this new command, it's a straightforward tabulation:

Code:

 
. l PC varname corr 

     +-------------------------------+
     | PC        varname        corr |
     |-------------------------------|
  1. |  1       headroom    .6957921 |
  2. |  2       headroom    .6554101 |
  3. |  3       headroom    .2899519 |
  4. |  4       headroom   -.0472426 |
  5. |  5       headroom    .0026352 |
     |-------------------------------|
  6. |  1          trunk    .8405304 |
  7. |  2          trunk    .3144061 |
  8. |  3          trunk   -.4260833 |
  9. |  4          trunk    .1138242 |
 10. |  5          trunk    .0124329 |
     |-------------------------------|
 11. |  1         weight     .939158 |
 12. |  2         weight   -.2856239 |
 13. |  3         weight    .0409204 |
 14. |  4         weight   -.1042662 |
 15. |  5         weight    .1544515 |
     |-------------------------------|
 16. |  1         length    .9432383 |
 17. |  2         length   -.2035082 |
 18. |  3         length   -.0582883 |
 19. |  4         length   -.2244516 |
 20. |  5         length   -.1229222 |
     |-------------------------------|
 21. |  1   displacement    .8942441 |
 22. |  2   displacement   -.2908539 |
 23. |  3   displacement     .193391 |
 24. |  4   displacement    .2760232 |
 25. |  5   displacement   -.0462885 |
     +-------------------------------+

. tabdisp varname PC, cell(corr) format(%4.3f)

-----------------------------------------------------
             |                   PC                  
    variable |      1       2       3       4       5
-------------+---------------------------------------
    headroom |  0.696   0.655   0.290  -0.047   0.003
       trunk |  0.841   0.314  -0.426   0.114   0.012
      weight |  0.939  -0.286   0.041  -0.104   0.154
      length |  0.943  -0.204  -0.058  -0.224  -0.123
displacement |  0.894  -0.291   0.193   0.276  -0.046
-----------------------------------------------------

Another thing I often do is plot loadings in a particular way, as already documented in eofplot (SSC). Another way of getting that easily is to present the data as panel data:

Code:

 
. xtset PC varname
       panel variable:  PC (strongly balanced)
        time variable:  varname, 1 to 5
                delta:  1 unit

. xtline loading, overlay xla(, valuelabel) recast(connected) legend(pos(3) col(1)) yla(, ang(h))

Tags: None

Maria Ribeiro

Join Date: Apr 2015

Posts: 42
#2

16 Jun 2015, 07:43

Based on the above outcomes, is it correct to say that PC1 captures most of the total sample variance? Therefore in future approaches, can it be considered as an alternative to this illustrative variables?

Thanks,
Maria
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#3

16 Jun 2015, 08:39

I assume you're referring to the example above. Here PC1 captures 75% of the total variance based on standardised variables. Objectively, that's "most", indeed.

More crucially, whether PC1 is an adequate substitute for the original variables is a substantive decision for researchers. I can readily imagine different researchers jumping either way, some saying "Not enough", some saying "That's fine".
Comment

Announcement

pcacoefsave available from SSC (a utility for pca users)

Comment

Comment