What independent variables to include in the PCA?

AC Morante

Join Date: Apr 2022

Posts: 3
#1

What independent variables to include in the PCA?

15 Apr 2022, 01:59

So I am using panel data and plan to use a fixed-effect model, but some of my independent variables are strongly correlated. My equation is specified as follows:
ln EXP_ijt = ꞵ₀ + ꞵ₁ lnGDP_it + ꞵ₂ lnGDP_jt + ꞵ₃ lnPOP_it + ꞵ₄ lnPOP_jt +

ꞵ₅ lnDIST_ij + ꞵ₆ lnTEL_it+ ꞵ₇ lnTEL_jt+ ꞵ₈ lnMOB_it+ ꞵ₉ lnMOB_jt +

ꞵ₁₀ lnBRO_it+ ꞵ₁₁ lnBRO_jt+ ꞵ₁₂ lnINT_it+ ꞵ₁₃ lnINT_jt + ꞵ₁₄BOR_ij + ꞵ₁₅Lang_ij + ꞵ₁₆RTA_ij +ε_it

and the ICT variables I used have strong collinearity, so I plan to use PCA to reduce/remove multicollinearity, and my question is should I use all my independent variables when I run a PCA or include only the correlated variables (i.e ICT variables), then predict an index for it to replace the correlated variables before running a regression?

Or in other words, can I do this:

run a PCA for variables TEL_{it ,}MOB_{it ,}BRO_{it ,}INT_{it ,}then predict PC1 (assuming the results says the first component only)
run another PCA for variables TEL_j_{t ,}MOB_j_{t ,}BRO_j_{t ,}INTj_t,then predict the score

the two different scores/indexes I predicted will be used to substitute the said correlated ICT variables, then perform FEM regression afterward. Also, should I transform the scores/indexes into log form to be consistent with other control variables in my model?

Thank you in advance!

Last edited by AC Morante; 15 Apr 2022, 02:04.
Tags: Correlated, FEM, Multicollinearity, panel data, pca
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

15 Apr 2022, 11:48

My guess is that you shouldn't do a PCA at all.

In a regression, there are two kinds of variables on the right hand side of the equation. There are the key variables of interest--the ones you want to estimate the effects of. These are why you are doing a regression in the first place. Then there are covariates, which are not of interest in their own right but are included to remove their nuisance effects, to reduce omitted variable bias. These are often called "control variables." The impact of multicolinearity depends on which kind of variables are involved.

If only the covariates are involved in a multicolinear relationship, this has no consequences at all for the estimation of key variable effects. Constantly, it is a waste of time and effort to do anything at all about it. Just leave the variables as they are and ignore their coefficients in the output.

If the multicolinearity involves one of the key variables of interest, then you may have a problem on your hands. But PCA isn't going to solve it. If you do the PCA just among the involved covariates and exclude the key variable(s) from the PCA, then the resulting components, though now orthogonal to each other, will still be highly colinear with the key variable(s), and you will have accomplished nothing. Now, if you use the PCA to instead create a single index variable (first component) and use only that in the regression, there are two possibilities. That first PCA may still be highly colinear with the key variable(s)--in which case it will accompish little or nothing. In fact, if that first PCA is a "good index" in the sense of capturing a large proportion of the variance of the entire group of variables used in the PCA, it will be highly correlated with the key variable(s)--to about the same extent as the original variables were. If, however, it captures only a modest proportion of the PCA variables' variance, then it may well show little correlation to the key variable(s). However, in that case, it will be inadequate for reducing omitted variable bias--that is, it will defeat the purpose of having those covariates in the model in the first place.

So if using only the covariates in the PCA is not useful, what about including the key variable(s) it(them)selves. Well, that will give you a new set of orthogonal variables and the colinearity will be gone. But now your key variables are no longer in the model: rather they have been mangled into the components and are mixed with the covariates (and if we are talking about multiple key variables, they are no blended with each other as well. So you can't actually say anything about their effects. Which defeats the purpose of doing the model in the first place.

Arthur Goldberger's econometrics textbook should be required reading for everybody who does regression analysis. His chapter on multicollinearity is the definitive work on the subject. The bottom line is that multicolinearity should be referred to properly is hyponumerosity: the only proper solution to its effects is to get a larger data set, usually a much larger one.

So I would just forget about it. You know you have multicolinearity, but see if you actually have a multicolinearity problem. That would only be the case if the confidence intervals for the coefficients of the key variables (the coefficients of the covariates, remember, don't matter) are so wide that your study is inconclusive, then you have a problem and need a larger data set. If your confidence intervals, however, are narrow enough that the compatible effect sizes are either all in a range that is meaningful, or all in a range that is not meaningful, then your study reaches a conclusion either way and there is no problem from the multicolinearity.
2 likes
Comment
AC Morante

Join Date: Apr 2022

Posts: 3
#3

16 Apr 2022, 02:31

Thank you for answering!

But could you further help me in identifying whether the confidence intervals for the coefficients of my key variables are narrow or wide? Is there a command in Stata to plot the C.I after xtreg?
Also, I have already increased my data set however, there are key variables that are "repetitive" since for every importing country (represented by subscript j) that I add, I input the same thing (those are the key variables with subscript i which represents the ICT variables of a particular exporting country).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

16 Apr 2022, 10:50

The -coefplot- command graphs the coefficients from a regression and provides error-bars corresponding to the confidence intervals. I have not used it much myself, but it seems a very popular way for people to show regression results. You can find instructions at -help coefplot-.

As for identifying whether the confidence intervals for the coefficients of your key variables are narrow or wide, that is a judgment call that you need to make based not on statistical criteria but on substance. So I can't help you specifically do that for your results. But here's an illustration of the approach. Suppose I am evaluating the effect of an intervention on some outcome, and in the units used in the regression, I decide that a difference in outcome of 2.5 units is large enough to matter for practical purposes. (For example, that amount of difference might be enough to justify the cost of the intervention or outweigh its side effects, or something like that.) If I have a confidence interval like (2.6, 3.6), then the entire confidence interval lies in "large enough" territory. If I have an equally wide confidence interval of (1.3, 2.3) then, even though this corresponds to a "statistically significant" result, the whole thing lies in "not large enough" territory. If I have a confidence interval like (2.0, 3.0), then my study is inconclusive because the interval straddles the threshold of "large enough" and lies in both "large enough" and "not large enough" territory. So the whole thing boils down to identifying a threshold effect that is "large enough to matter for practical purposes." And "large enough to matter for practical purposes" depends on the specific context of your study and is, in the end, a judgment call. Then you just see where your confidence interval lies with respect to that threshold. In a sense it is similar to assessing "statistical significance," but the reference point is not zero, it is a threshold of practical importance (which will rarely be zero in real applications.)
1 like
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#5

16 Apr 2022, 11:23

Another thing to keep in mind in concert with Clyde Schechter's advice is the units your variable is measured in and other contextual factors. I'll give a personal example.

For a paper I'm doing on mass vaccination events (likely for American Journal of Public Health, American Statistician, or Statistics in Medicine), my average treatment effect is about -40 fewer COVID-19 cases per 100k a month after the event happened. Now, that may not seem like much on its own- but once you consider that these are tens of thousands of fewer sick people in the areas in question, -40 doesn't seem so bad consisting that this is only one month after the intervention. And anyways, given COVID's contagiousness, maybe this is a meaningful effect-size for what amount to county-wide events that happen for only one day.

Equally important is the time scale you're concerned with. Sometimes folks get hung up about not seeing an immediate effect in their intervention. But, when we look in the longer term, after a month, cases are, on average, -100 per 100k than they'd otherwise be, compared to places that didn't have a mass vaccination event. For example, I found that places that enacted the event earlier had a bigger treatment effect relative to those who didn't.

Regarding your CIs, equal care should be given to them. People still get hung up on their results being "barely" significant (or not), but so focused on that, are they, that they ignore that the treatment had a pretty big effect in some places and not in others, effect size heterogeneity.

Either way, stats is hard work, and the software is usually only half the battle (unless you write your own ado code); so long as you interpret your results with care and have a nuanced approach, you'll be fine.
2 likes
Comment

Announcement

What independent variables to include in the PCA?

Comment

Comment

Comment

Comment