Interpretation of Interaction in Principal Components Regression

Carrie Dolan

Join Date: Jan 2016
Posts: 40

Interpretation of Interaction in Principal Components Regression

14 Sep 2018, 05:23

I am using Stata v15.

My specific question is that I am not sure how to interpret the interaction in my regression when the factor loadings are positive and negative.

The analysis is outlined below.

First, I standardized the variables used in the PCA as follows:

Code:

*Standardize variables
foreach v of var u5m_2003 u5m_2008 u5m_2013 neonatal_2003 neonatal_2008 neonatal_2013 u5dia_2003 u5dia_2008 u5dia_2013 u5ari_2003 u5ari_2008 u5ari_2013 u5fever_2003 u5fever_2008 u5fever_2013 u5ft_2003 u5ft_2008 u5ft_2013 vaccp_2003 vaccp_2008 vaccp_2013 sba_2003 sba_2008 sba_2013 anc_2003 anc_2008 anc_2013 postnatal_2003 postnatal_2008 postnatal_2013 u5net_2003 u5net_2008 u5net_2013{
    qui su `v'
    g double z_`v' = (`v' - r(mean))/r(sd)
}

Second, I conduct the PCA for the baseline variables as follows:

Code:

ctor determinants (2013)
factor z_u5m_2003 z_neonatal_2003 z_u5dia_2003 z_u5ari_2003 z_u5fever_2003 z_u5ft_2003 z_vaccp_2003 z_sba_2003 z_anc_2003 z_postnatal_2003 z_u5net_2003,pcf
rotate
predict factor1_2003 factor2_2003 factor3_2003

The resulting output is:

Code:

factor z_u5m_2003 z_neonatal_2003 z_u5dia_2003 z_u5ari_2003 z_u5fever_2003 z_u5ft_2003 z_vaccp_2003 z_sba_2003 z_anc_2003 z_postnatal_2003 z_u5net_2003,pcf
(obs=80)

Factor analysis/correlation                      Number of obs    =         80
    Method: principal-component factors          Retained factors =          3
    Rotation: (unrotated)                        Number of params =         30

    Beware: solution is a Heywood case
            (i.e., invalid or boundary values of uniqueness)

    --------------------------------------------------------------------------
         Factor  |   Eigenvalue   Difference        Proportion   Cumulative
    -------------+------------------------------------------------------------
        Factor1  |      5.93522      2.32212            0.5396       0.5396
        Factor2  |      3.61309      2.16140            0.3285       0.8680
        Factor3  |      1.45169      1.45169            0.1320       1.0000
        Factor4  |      0.00000      0.00000            0.0000       1.0000
        Factor5  |      0.00000      0.00000            0.0000       1.0000
        Factor6  |      0.00000      0.00000            0.0000       1.0000
        Factor7  |     -0.00000      0.00000           -0.0000       1.0000
        Factor8  |     -0.00000      0.00000           -0.0000       1.0000
        Factor9  |     -0.00000      0.00000           -0.0000       1.0000
       Factor10  |     -0.00000      0.00000           -0.0000       1.0000
       Factor11  |     -0.00000            .           -0.0000       1.0000
    --------------------------------------------------------------------------
    LR test: independent vs. saturated:  chi2(55) =       . Prob>chi2 =      .

Factor loadings (pattern matrix) and unique variances

    -----------------------------------------------------------
        Variable |  Factor1   Factor2   Factor3 |   Uniqueness
    -------------+------------------------------+--------------
      z_u5m_2003 |   0.9645    0.1389   -0.2248 |     -0.0000  
    z_neona~2003 |  -0.2173    0.9517    0.2169 |      0.0000  
    z_u5dia_2003 |   0.9914    0.0754    0.1068 |     -0.0000  
    z_u5ari_2003 |   0.6885    0.2352    0.6860 |      0.0000  
    z_u5fev~2003 |   0.9777    0.1643   -0.1305 |     -0.0000  
     z_u5ft_2003 |  -0.2040   -0.8579    0.4716 |      0.0000  
    z_vaccp_2003 |  -0.8972    0.1648   -0.4096 |      0.0000  
      z_sba_2003 |  -0.9178    0.3289    0.2225 |      0.0000  
      z_anc_2003 |  -0.9006    0.1924    0.3898 |      0.0000  
    z_postn~2003 |   0.2018    0.9076    0.3680 |      0.0000  
    z_u5net_2003 |   0.0641   -0.9316    0.3577 |     -0.0000  
    -----------------------------------------------------------

. rotate

Factor analysis/correlation                      Number of obs    =         80
    Method: principal-component factors          Retained factors =          3
    Rotation: orthogonal varimax (Kaiser off)    Number of params =         30

    Beware: solution is a Heywood case
            (i.e., invalid or boundary values of uniqueness)

    --------------------------------------------------------------------------
         Factor  |     Variance   Difference        Proportion   Cumulative
    -------------+------------------------------------------------------------
        Factor1  |      5.21423      1.71609            0.4740       0.4740
        Factor2  |      3.49815      1.21053            0.3180       0.7920
        Factor3  |      2.28762            .            0.2080       1.0000
    --------------------------------------------------------------------------
    LR test: independent vs. saturated:  chi2(55) =       . Prob>chi2 =      .

Rotated factor loadings (pattern matrix) and unique variances

    -----------------------------------------------------------
        Variable |  Factor1   Factor2   Factor3 |   Uniqueness
    -------------+------------------------------+--------------
      z_u5m_2003 |  -0.9571    0.1870    0.2212 |     -0.0000  
    z_neona~2003 |   0.3735    0.8760    0.3051 |      0.0000  
    z_u5dia_2003 |  -0.8567    0.0487    0.5136 |     -0.0000  
    z_u5ari_2003 |  -0.3359    0.0707    0.9392 |      0.0000  
    z_u5fev~2003 |  -0.9296    0.1900    0.3158 |     -0.0000  
     z_u5ft_2003 |   0.2925   -0.9435    0.1557 |      0.0000  
    z_vaccp_2003 |   0.6734    0.2548   -0.6940 |      0.0000  
      z_sba_2003 |   0.9574    0.2687   -0.1057 |      0.0000  
      z_anc_2003 |   0.9950    0.0973    0.0212 |      0.0000  
    z_postn~2003 |   0.0461    0.7983    0.6005 |      0.0000  
    z_u5net_2003 |  -0.0044   -0.9890    0.1478 |     -0.0000  
    -----------------------------------------------------------

Factor rotation matrix

    -----------------------------------------
                 | Factor1  Factor2  Factor3
    -------------+---------------------------
         Factor1 | -0.9138  -0.0000   0.4062
         Factor2 |  0.0937   0.9730   0.2107
         Factor3 |  0.3953  -0.2306   0.8891
    -----------------------------------------

. predict factor1_2003 factor2_2003 factor3_2003
(regression scoring assumed)

Then I used the predicted values in my regression as follows:
(Note:I'm only using the first two factors and not the third, because they explain the majority of the variability)

Code:

reg WB_commit i.year crossover##(c.factor1_2003 c.factor2_2003), cluster(n_region)

and the resulting output is:

Code:

Linear regression                               Number of obs     =         80
                                                F(2, 3)           =          .
                                                Prob > F          =          .
                                                R-squared         =     0.7753
                                                Root MSE          =     1.3e+06

                                           (Std. Err. adjusted for 4 clusters in n_region)
------------------------------------------------------------------------------------------
                         |               Robust
               WB_commit |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------------------+----------------------------------------------------------------
                    year |
                   1996  |   13933.84   8150.392     1.71   0.186    -12004.35    39872.02
                   1997  |   -9955.19   11737.44    -0.85   0.459    -47308.95    27398.57
                   1998  |   -7415.08   12926.08    -0.57   0.606    -48551.65    33721.49
                   1999  |   -9955.19   11737.44    -0.85   0.459    -47308.95    27398.57
                   2000  |   480178.9   136738.8     3.51   0.039      45015.2    915342.7
                   2001  |   993111.9     176355     5.63   0.011     431871.6     1554352
                   2002  |    2166282   629478.6     3.44   0.041     163000.4     4169564
                   2003  |   954529.3   275346.2     3.47   0.040     78254.79     1830804
                   2004  |  -7535.359   11754.28    -0.64   0.567    -44942.72       29872
                   2005  |    1625435   471472.5     3.45   0.041     124999.3     3125871
                   2006  |    1645258   826210.1     1.99   0.141    -984111.2     4274627
                   2007  |   470547.2   52232.61     9.01   0.003     304319.7    636774.7
                   2008  |    3983686   580806.6     6.86   0.006      2135301     5832072
                   2009  |    6977445    2099152     3.32   0.045     297007.4    1.37e+07
                   2010  |  -9955.193   328250.6    -0.03   0.978     -1054595     1034685
                   2011  |    1663598   277110.5     6.00   0.009     781708.6     2545487
                   2012  |    5194180    2057367     2.52   0.086     -1353281    1.17e+07
                   2013  |  -5587.206   331510.7    -0.02   0.988     -1060602     1049428
                   2014  |  -1058.305     330192    -0.00   0.998     -1051877     1049760
                         |
             1.crossover |          0  (omitted)
            factor1_2003 |  -229838.4    62332.2    -3.69   0.035    -428207.3   -31469.53
            factor2_2003 |   963.0723   57286.32     0.02   0.988    -181347.6    183273.7
                         |
crossover#c.factor1_2003 |
                      1  |  -467120.3   439140.7    -1.06   0.365     -1864662    930421.5
                         |
crossover#c.factor2_2003 |
                      1  |   64664.85   403591.6     0.16   0.883     -1219744     1349074
                         |
                   _cons |   9955.188   150387.5     0.07   0.951    -468645.1    488555.5
------------------------------------------------------------------------------------------

Returning to my specific question, how do I interpret the coefficient for crossover#c.factor1_2003 when some of the factor loadings were negative? According to the literature on factor score indetermination the signs are arbitrary, but I'm not sure how this translates to the interpretation of the coefficient. Some researchers recode the factors to avoid negative values, but my data isn't on a likert scale. Is it correct to say: Although insignificant results indicate that aid levels are lower as need increased?

Thanks in advance for your time.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

14 Sep 2018, 09:11

Is it correct to say: Although insignificant results indicate that aid levels are lower as need increased?

You need to explore the meaning of the components you have created. There is no way to tell from what you've shown.

As you note, the components are indeterminate, and a solution that reverses all the signs is just as valid as the one you have. What you can say is that when crossover = 0, a unit increase in factor1 is associated with an expected decrease in WB_commit, and when crossover = 1, it is associated with an even greater expected decrease. But what you have to resolve here is whether an increase in factor1 represents an increase in need, a decrease in need, or neither. Given the mix of positive and negative loadings of the original items on factor1, you have to put the semantics of all of those individual variables together to see if you can come up with a semantic interpretation of factor1. It may be that the best interpretatoin of factor 1 is that it reflects the difference between different aspects of "need" (assuming that each of the individual variables is some aspect of need). Or it may be just some arbitrary and uninterpretable variable about which little of substance can be said.

Why did you do a principal components analysis in the first place? If there is no compelling reason for this, and if the resulting components are not interpretable in a substantive way, you might be better off just going back to the original variables. Or, if the idea was to try to reduce the dimensionality of the predictor space from 11 to 2, you might have better luck with an oblique rotation.

Finally, though you do not ask about it, I note that you are using the cluster robust vce with only 4 clusters. The cluster robust vce is not valid when the number of clusters is too small. There is no consensus about a minimum number, but I don't know of anybody who would accept n = 4 clusters.
Comment
Carrie Dolan

Join Date: Jan 2016

Posts: 40
#3

14 Sep 2018, 11:21

Thank you for the response.The reason for doing the PCA was to create a summary measure of need based on a broad framework of maternal and child health indicators.

I went back and reviewed my output considering your response. An important limitation of my first analysis is that the PCA is a Heywood case. Due to the small sample size the PCA really couldn't support the number of inputs in a meaningful way. Therefore, I went back and revised the number of inputs to only include "main" maternal and child health indicators (ie.death, diarrhea, upper respiratory infection) and not the proximate determinants (ie. service utilization indicators). After doing that is made it easier to develop a semantic interpretation of the individual variables (as described in the previous post) together which is Factor 1: U5 need and Factor 2: Infant need.

Finally, thanks for pointing out the issue of using the cluster robust vce with only 4 clusters. For people who may subsequently read this post I found the following resources helpful:

1. Cameron, A. Colin, Jonah B. Gelbach, and Douglas L. Miller. "Bootstrap-based improvements for inference with clustered errors." The Review of Economics and Statistics 90.3 (2008): 414-427.
2. https://blogs.worldbank.org/impactev...er-of-clusters
Comment

Announcement

Interpretation of Interaction in Principal Components Regression

Comment

Comment