Test for multicollinearity - logistic regression

Nico Dekker

Join Date: Mar 2021

Posts: 5
#1

Test for multicollinearity - logistic regression

28 May 2023, 14:21

Hi all,

Thanks for the help in advance.
I want to run a logistic regression. My dependent variable is binary (majority / minority). In my model, I include two independent variables (i.e. independent 1, can take any value between 1-100, and independent 2, can take any value between 0 - 7). I also have a moderation added for the two independent variables. I also include three control variables. Two are categorical with 3 categories for the first, and 4 categories for the other variable. The last control variable can take any value between 1 and 100 again.
In stata, I use a logistic regression. I code the categorical variables using i.categorical1. Then, I obtain the VIF score using VIF, uncentered. I get extremely high VIF scores (80 and 100) for independent variable 1, and control variable which can take any value between 1 and 100. The correlation matrix does not have extreme correlations, with the highest being 0.67.

How do I go about this? Am I doing it right? Am I calculating the VIF scores right? Should I go about these categorical variables different? The two independent variables are needed for my research question. Should I then drop my control variable?
In the models where independent variable 1 is not present, no high VIF scores (all under 3) where reported.

Many thanks for the help already.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#2

28 May 2023, 16:37

You are in a situation where you have the potential for a "multicolinearity" problem. That is, you have two rather strongly correlated right hand side variables r = 0.67, one of which is a key independent variable. Remember that multicolinearity does not introduce bias into the results of a regression, it just reduces estimation precision. To determine whether you have a problem or not, just look at the standard error for independent variable 1. If it is too large (equivalently, if the confidence interval is so wide that you are unable to draw a conclusion about the effect of that variable that will provide an answer to your research question), then you have a "multicolinearity" problem. But if your standard error is small enough (equivalently, your confidence interval is narrow enough that you can answer your research question about independent variable 1), then there is no problem and you can just move on.

I have been putting "multicolinearity" in quotes because, as Arthur Goldberger points out in his textbook "A Course in Econometrics" multicolinearity is just a misleading name for having too small a sample size. He prefers the term "micronumerosity," which has the advantage of also pointing you in the direction of the only way to solve such a problem: getting a bigger or better data sample. If you can get your hands on Goldberger's book, the chapter on this subject is enormously enjoyable. If you can't, you can get a quick, but not nearly as entertaining, summary of what he has to say about this at https://www.econlib.org/archives/200...ollineari.html.

Removing the offending covariate ("control variable") is definitely not a suitable solution to a multicolinearity problem, in fact, it's about the worst thing you can possibly do. That strong correlation implies that you really need to keep that variable in the model to avoid omitted variable bias, unless that covariate is independent of the outcome variable. (But if that was the case, there was no real reason to include it in the first place.)

Notice that I have not mentioned VIF here. The only thing it really tells you is which variables are involved in the multicolinearity. But since you can't do anything with those variables anyway, there isn't much value to knowing that. So don't waste any more time on it. Just look at the standard error or confidence interval of independent variable 1. If that is OK, there is nothing more to do--you have no problem--move on. But if it shows you have an inconclusive analysis, then you must either get a larger sample (usually much larger, to an extent that may be difficult or entirely infeasible), or you must get data with a sampling design that breaks the multicolinearity (e.g. some kind of matched pairs scheme or oversampling observations with discordant observations of the correlated variables.) Tinkering with the model in existing sample will not help, and might do considerable harm.
1 like
Comment
Nico Dekker

Join Date: Mar 2021

Posts: 5
#3

29 May 2023, 02:40

Hi Clyde,

Thanks for the response. It really helps me out. I am currently in the progress of writing my master thesis, and this is definitely not my strongest part. It helps to obtain clarity on what these numbers actually mean.

The standard error for independent variable 1 is actually really small. It has a z of -2.41.

How would you recommend writing this down, academically?

I right now wrote: "To test for multicollinearity, the Variance Inflation Factors (VIFs) have been stated for each empirical model. The mean VIF for each model is reported in Table 3 in the Appendix section. Based on the VIF scores, high multicollinearity can be found in the models which include independent variable 1 (i.e. model 2, 4 and 5). Although this limitation will be acknowledged, the nature of the research necessitates the incorporation of independent variable 1. Caution should be exercised when interpreting the individual effects of the correlated variables in these models."

Originally posted by Clyde Schechter View Post

. If that is OK, there is nothing more to do--you have no problem--move on. But if it shows you have an inconclusive analysis.

--> Is this something which can be mentioned in the limitations section of this paper?
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2121
#4

29 May 2023, 08:53

Just a few additions to Clyde's helpful response. As Clyde said, the relevant information is in the standard error. And I'm with Clyde in wondering why you're "testing" for multicollinearity. There's no such thing. There are diagnostics that come with essentially arbitrary rules of thumb. There is no well-defined null hypothesis of "no multicollinearity" unless we insist our regressors are always uncorrelated -- and we simply can't insist on that or regression, logit, and so on would be nearly useless.

Having said that, there are a couple of issues to be aware of. My suspicion is that the high VIFs are due to the interaction (moderating) term. If x1 and x2 are not centered before constructing the interaction, the coefficients on x1 and x2 can be useless. This happens in linear models and also in nonlinear models such as logit, but it's more subtle. In the linear case, we have

y = b0 + b1*x1 + b2*x2 + b3*x1*x2 + u

Then b1 is the partial effect of x1 on y with x2 = 0. Is this value of x2 interesting? Is it even possible? If not, b1^ is pretty meaningless. Moreover, that b1 is measuring an impossible effect is often reflected in severe collinearity between x1 and x1*x2. It's telling you you're trying to estimate something you shouldn't be. Centering could "solve" the collinearity problem and give you a better parameter to interpret. These are linked together.

I'm guessing you're happy with the effect of x1 and x2 on the log-odds. But you might be interested in the average marginal effect on the probability itself. Personally, I'd be more interested in the latter. You can have both.

Code:

sum x1 gen x1_dm = x1 - r(mean) sum x2 gen x2_dm = x2 - r(mean) logit y c.x1_dm c.x2_dm c.x1_dm#c.x2_dm margins, dyxx(x1_dm x2_dm)
2 likes
Comment
Nico Dekker

Join Date: Mar 2021

Posts: 5
#5

29 May 2023, 14:09

Thanks for the help! Jeff Wooldridge

Multicollinearity is something my thesis supervisor pointed out, hence why I want to address it of some kind. Are you familiar with any papers, which go over multicollinearity not being of high importance? Are there any prominent papers going over that?

Thanks in advance, and thanks again!
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2121
#6

30 May 2023, 21:39

I'm not sure what else to say. You can find discussions in the econometrics book by Arthur Goldberger as well as my own introductory econometrics book (Chapter 3) on why fixating on multicollinearity is usually counterproductive. I did a quick Google Scholar search for articles in the American Economic Review and Quarterly Journal of Economics for "variance inflation factor." I couldn't find one paper that calculates these diagnostics -- for good reason. All uncertainty is captured by the reported standard errors. If they are small enough to give you usable confidence interval then that's what matters.

Having said that, did you try centering your variables before interacting them? That should help a lot with the VIFs if you're concerned about them.
1 like
Comment

Announcement

Test for multicollinearity - logistic regression

Comment

Comment

Comment

Comment

Comment