Principal component regression PCR

ji zhou

Join Date: Jul 2014

Posts: 46
#1

Principal component regression PCR

28 Aug 2014, 10:45

Hello experts, I'm working with university rankings data. As we all know, the variables are highly correlated, e.g., acceptance rate and average test scores for admission. It seems that PCR is the way to deal with multicollinearity for regression. I have read about PCR and now understand the logic and general steps. But I can't find a stata example with codes to do the analysis. Could anyone please help? Is there any source I could read?

This task on rankings is from work, but I am personally highly interested in rankings. So I would love to understand the data in the best way possible. Thanks a million in advance!

Last edited by ji zhou; 28 Aug 2014, 10:49.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35405
#2

28 Aug 2014, 11:03

I don't think there is anything that really needs documenting here. You do pca on some variables, get scores for PCs by using predict, and then use those score variables as predictors in regress. However, I think you are right that not much has been written on this in Stata and I guess that's mostly because it's not often used.

The bigger question is thus whether PCR really is the right way to proceed. Here sixty experienced researchers would give you six different opinions. Here's one: Modern regression software is perfectly capable of handling collinearity; if two variables appear highly collinear one will be omitted; or you can exercise judgment and choose from a bundle of highly correlated predictors. PCR may solve the problem of collinearity, but it replaces it with another problem, greater difficulty in interpretation.
1 like
Comment
ji zhou

Join Date: Jul 2014

Posts: 46
#3

28 Aug 2014, 11:11

Thank you, Nick, for explaining the steps which sound pretty doable. You are exactly right about interpretation, which is also one of my concerns. But I will give it a try and see what results I will get.

Previously, I tried regress with all the variables used to calculate the rankings. No variable was omitted by stata, despite the high correlation of .85 for some. Does this mean that stata only omits variables with 100% correlation?

In addition, do you have other suggestions regarding how to understand highly correlated data better?

Many thanks for your help!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35405
#4

28 Aug 2014, 11:14

Your last question is a good one, but I can't give useful advice briefly. A correlation of 0.85 is not necessarily fatal, as you've discovered. The converse is that a world in which all predictors were uncorrelated would be a fairly weird world.

You are writing almost as if it were Stata's job to choose a model for you, but Stata can't be a social scientist on your behalf.
Comment
ji zhou

Join Date: Jul 2014

Posts: 46
#5

28 Aug 2014, 11:49

One thing I plan to do is to use the z-scores of the variables for my school across years and see if how much change in a particular variable is associated with change in the rankings. So far, I have analyzed the data by year instead of by a particular school across years.

I am not hoping that stata will choose a model for me ^_^. I'm still learning stata and sometimes I wonder if there are amazing things it can do that I'm not aware of. Thanks again for your help!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29944
#6

28 Aug 2014, 12:04

Correlated variables aren't necessarily a problem. If the correlated variables in question are simply in the model because they are nuisance variables whose effects on the outcome must be taken into account, then just throw them in as is and don't worry about them. (And don't try to interpret their regression coefficients or statistical significance separately.) If the correlation between them is high enough that the regression calculations become numerically unstable, Stata will drop one of them--which should be no cause for concern: you don't need and can't use the same information twice in the model.

The correlation between two variables is a problem only if the focus of your research question is about how they separately influence the outcome. In that case you either need a very large sample, or, better, if feasible, a complex design that brings you data in which the correlation between them is greatly reduced or eliminated (e.g. matching on one of the variables).
Comment
ji zhou

Join Date: Jul 2014

Posts: 46
#7

29 Aug 2014, 09:09

Thank you Clyde! What you explained and suggested is very helpful. In this task, the research question is indeed how different (but highly correlated) ranking variables separately influence the ranking of a particular school. But since stata didn't drop any variable, the correlation (ranging from .4 to .8) doesn't appear to be fatal.
Comment

Announcement

Principal component regression PCR

Comment

Comment

Comment

Comment

Comment

Comment