Partial Component Analysis - collinearity and postestimation

Rosen Vasilev

Join Date: Jul 2016

Posts: 3
#1

Partial Component Analysis - collinearity and postestimation

21 Jul 2016, 02:50

Dear all

I am new in the forum and I would like to ask you one question pertaining to the topic described above Principal Component analysis,not Partial,sorry for the mistake . Since I detected some collinearity in my independent variables and I ran the PCA methodology to try to handle this issue. In my case I had three independent variables and I ran the following regression "pca (my dependent var) (the three independent vars) . After that I obtained results which consisted of three components. My question is are those three obtained components signify the modified 3 independent variables in my list which are corrected for collinearity. And if that is the case , then eventually I should use the command ""predict pc1 pc2 pc3"" to obtain estimates for the 3 independent variables which I should then use in a regression. Can u please advise me if my logic and line of work is correct? I read many sources but it is a bit confusing to me how exactly I should implement this methodology in event study. Thank you very much for your attention and wish you a nice day.

Regards
Rosen

Last edited by Rosen Vasilev; 21 Jul 2016, 02:57.
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35356

21 Jul 2016, 04:40

The term "partial component analysis" you use appears to be a slip for principal component analysis. But more importantly PCA doesn't use a distinction between dependent (response) and independent (predictor) variables at all, and so mentioning a variable first does not flag it to the pca command as dependent. PCA is not a kind of regression in that sense. In Stata, and indeed elsewhere, the order of the variables is immaterial:

Code:

. sysuse auto
(1978 Automobile Data)

. pca weight trunk length

Principal components/correlation                  Number of obs    =        74
                                                  Number of comp.  =         3
                                                  Trace            =         3
    Rotation: (unrotated = principal)             Rho              =    1.0000

    --------------------------------------------------------------------------
       Component |   Eigenvalue   Difference         Proportion   Cumulative
    -------------+------------------------------------------------------------
           Comp1 |      2.56955      2.18969             0.8565       0.8565
           Comp2 |      .379868       .32929             0.1266       0.9831
           Comp3 |     .0505775            .             0.0169       1.0000
    --------------------------------------------------------------------------

Principal components (eigenvectors)

    ----------------------------------------------------------
        Variable |    Comp1     Comp2     Comp3 | Unexplained
    -------------+------------------------------+-------------
          weight |   0.5924   -0.4459    0.6710 |           0
           trunk |   0.5333    0.8413    0.0883 |           0
          length |   0.6039   -0.3055   -0.7362 |           0
    ----------------------------------------------------------

. pca trunk length weight

Principal components/correlation                  Number of obs    =        74
                                                  Number of comp.  =         3
                                                  Trace            =         3
    Rotation: (unrotated = principal)             Rho              =    1.0000

    --------------------------------------------------------------------------
       Component |   Eigenvalue   Difference         Proportion   Cumulative
    -------------+------------------------------------------------------------
           Comp1 |      2.56955      2.18969             0.8565       0.8565
           Comp2 |      .379868       .32929             0.1266       0.9831
           Comp3 |     .0505775            .             0.0169       1.0000
    --------------------------------------------------------------------------

Principal components (eigenvectors)

    ----------------------------------------------------------
        Variable |    Comp1     Comp2     Comp3 | Unexplained
    -------------+------------------------------+-------------
           trunk |   0.5333    0.8413    0.0883 |           0
          length |   0.6039   -0.3055   -0.7362 |           0
          weight |   0.5924   -0.4459    0.6710 |           0
    ----------------------------------------------------------

There may be some use for PCA in looking at the structure of collinearity among predictors but I would almost always use regression directly. First-round regression results (e.g. coefficient tables, added variable plots) help you see if a predictor could be omitted without loss. The decision should typically be substantive as much as statistical, e.g. omit a variable which seems incidental rather than central in meaning, omit a predictor that is more difficult to measure or explain if you have a choice. Almost any regression of interest involves predictors that are correlated substantially with each other: outside of an experimental design, it's hard to imagine how that might not happen. So collinearity is not at all the bugbear that some texts or courses make it out to be.

But it's arguably quite wrong to include the response (dependent variable) in a call to pca and then use the resulting PCs in a regression, as that entails an element of circular reasoning.

Comment

Rosen Vasilev

Join Date: Jul 2016

Posts: 3
#3

21 Jul 2016, 07:39

Hi Nick , first of all thank you for the detailed answer along with the example which you showed me. The thing with mine research is that I have to include in my case all the three dummy variables since my main research question is based on them. In my case, these dummies personify three periods( lets say 1999-2000,2001-2004 and 2005-2010) . What my results show when I run a linear regression is that one of the variables always gets omitted due to collinearity. I personally can live with that or try to omit the variable that is highly correlated but research will suffer in that case and when I present my results they might not be found substantial by the people who will examine it, I simply cant say in my research one of the dummies is omitted due to high colleration and results from the omitted dummy do not make a difference. Hence, I am looking for any way to improve these variables and try to get the results I need. So using this point, I would like to ask you , using the ''pca'' command and later the ''predict'' command to obtain the estimates for the components in question, do these components after the ''pca'' command are the very same dummy variables corrected for collinearity.In addition, I found a post online which stated that using the command ''pca'' there should be first dependent variable followed by independent ones as I did - http://www.stata.com/features/overvi...al-components/. I am a bit confused to be honest with this procedure. I havent studied it and that is why it is difficult to me to get into it.

Regards
Rosen
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29911
#4

21 Jul 2016, 07:55

I simply cant say in my research one of the dummies is omitted due to high colleration and results from the omitted dummy do not make a difference.

The omission of one of the indicator ("dummy") variables does not mean that the results from that omitted category do not make a difference.

Whenever you use indicator variables, to represent the levels of a categorical variable, there is always collinearity. Each observation necessarily belongs to one and only one level of the categorical variable, so it take on the value 1 for that indicator and 0 for all the others. Consequently the indicator variables must always sum to 1 in every observation. So one of the indicators is necessarily omitted. It is completely inappropriate to try to resolve this using principal components analysis, and will fail in the end anyway. If your concern is that you want to show either predicted outcomes or marginal effects for all levels of the variable in question, the omission of the base category is not an obstacle. The -margins- command will make the appropriate calculations if you use factor-variable notation (-help fvvarlist-) in your regression command. For example:

Code:

sysuse auto, clear regress price mpg i.rep78 // NOTE OMISSION OF ONE LEVEL OF rep78 IN OUTPUT margins rep78 // NOTE PREDICTED VALUES REPORTED AT ALL LEVELS OF rep78
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35356
#5

21 Jul 2016, 07:55

I really can't advise your using something you have not studied thoroughly and getting advice through a forum can't make up for basic misunderstandings.

Nor do I agree at all with your perception that you must or should use all three indicator (dummy) variables. That's quite wrong and Stata is utterly correct in omitting one of three variables in this circumstance. Any competent teacher or reviewer will understand that.

I can't see that PCA will help you here at all as in the case of (0, 1) variables there is really no extra structure that PCA can show you that isn't summarized in the means and correlations.

The web page you cite from www.stata.com starts

Stata’s pca allows you to estimate parameters of principal-component models.

and I am going to say that StataCorp's wording is in my view not helpful here at all, and I will today suggest that to them directly.

PCA is here, and everywhere, essentially a multivariate transformation. I don't think there is a model even tacit here, and (most crucially for you) if there is a model no variable is privileged or distinct as having a different status from the others. I have already demonstrated that the order of variables you submit to a pca command is not material, for the same reason.

That view on models is just possibly a little contentious, as possibly PCA could here be described as a limiting, even degenerate, case of a factor analysis model, but that is a more subtle issue and I don't think it has any bearing on your problem.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4426
#6

21 Jul 2016, 09:03

while I certainly agree that PCA is not what you should use (at least to the extent that I understand you), it is certainly possible, and sometimes reasonable to have a model with an indicator (dummy) for every category (and suppress the constant instead); note, however, that this changes the null hypotheses for the regression coefficients from a comparison to a reference group to a comparison with zero; try the following (and compare it to what Clyde suggested:

Code:

sysuse auto, clear regress price mpg ibn.rep78, hascons
Comment
Rosen Vasilev

Join Date: Jul 2016

Posts: 3
#7

24 Jul 2016, 06:00

First of all, thank you all, for the support with your thorough answers. I will try to implement what you have told me but in the meantime I think I found a way to improve my data. Since I used to have three dummy variables - signifying three different periods, , I dissected them into three additional periods to improve my data.Hence, right now I have six periods in total instead of three The results so far show no collinearity and results are more credible than before. I would also like to ask you a question pertaining the usage of the command - ''rreg'' which is for robust regression. When I run the ordinary least squares regression compared with the robust regression, results seem to differ. Do you know if the ''rreg'' is useful in many cases. In class, we never used it and I am a bit baffled to what extent this command can help me in my research. I read about it here: http://www.ats.ucla.edu/stat/stata/dae/rreg.htm but still I am not quite sure what to do with it.

Kind Regards
Rosen
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35356
#8

24 Jul 2016, 06:07

I don't think changing the number of indicator (dummy) variables is removing collinearity. One of several orthogonal variables will always be redundant.

Search the forum for discussions of rreg.
Comment

Announcement