Is multicollinearity between interaction terms a problem?

Tianzhu Nie

Join Date: Feb 2016

Posts: 18
#1

Is multicollinearity between interaction terms a problem?

08 Oct 2016, 19:15

Hi everyone,

I am working on a regression with three independent key variables, and let's call them a, b, and c, which are used to predict z, the dependent variable.

In the baseline model, the vif of these three variables is relatively low, ranging from 1 to 4.

However, in the interaction model, when I add the interaction between a and b, and a and c, the vif of the interaction terms increase to over 10.

By the way, the correlation between b and c is 0.7.

Could this be a problem?

Thank you for your kind attention!

Best,
Tianzhu
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

08 Oct 2016, 19:39

In my opinion, the VIF is one of the world's most over-rated statistics, and multicolinearity one of the world's most over-rated statistical issues.

High correlation among interaction terms and main effects is normal, and expected. Actually, it's inevitable.

But more generally, you need to distinguish between the mere presence of multicolinearity, which is not a problem in its own right, and a multcolinearity problem. The first thing to remember is that no matter how great the multicolinearity among a set of variables, it in no way compromises the estimates associated with the other variables in the regression. So if you have multicolinearity among variables that are included solely to adjust for their effects but whose effects are not directly of interest, then you can ignore it altogether. Don't waste even a second thinking about it. The second thing to remember is that, what multicolinearity does to the variables that are entangled in it, is increase the standard errors of the estimated coefficients. It does not bias the coefficient estimates: it just decreases efficiency. So the way to think about multicolinearity involving a variable whose coefficient is actually important for your research goals, is to look at the standard error (or, equivalently, the confidence interval). If the standard error is small enough (or, equivalently, if the confidence interval is narrow enough) that your research goals can be achieved, then you do not have a problem. This, of course, depends on your research goals and their context, and is not a statistical issue: it's a real world issue. Think about it this way: is the confidence interval so wide that it would actually matter from a practical perspective which end of the confidence interval the true value of the coefficient lay close to? If not, you have no problem. If so, then you really do have a problem.

Unfortunately, if you do have a problem, it is a problem that lacks a good solution. You can, of course, break the colinearity by removing some (often just one suffices) of the variables involved from the regression. But why was that variable (or those variables) included to begin with? Unless you are in the habit of including superfluous variables in your regressions, there is presumably a good reason why you needed that variable, and removing it may result in biasing your estimates of other variables (including the one of greatest interest) due to confounding (also known as missing variable bias). So what other solutions to this problem are there? Nothing easy. You can always overcome multicolinearity (unless it is a perfectly linear relationship among the variables) by getting more data. But usually it requires a massive amount of additional data, an amount that is prohibitive in practical terms. There is another approach to breaking multicolinearity: throw out your data and start over with a different study design. For example, using matched pairs, or stratified sampling can overcome multicolinearity. But a drastic step like that is obviously a last resort.
5 likes
Comment
Tianzhu Nie

Join Date: Feb 2016

Posts: 18
#3

08 Oct 2016, 21:30

Originally posted by Clyde Schechter View Post

In my opinion, the VIF is one of the world's most over-rated statistics, and multicolinearity one of the world's most over-rated statistical issues.

High correlation among interaction terms and main effects is normal, and expected. Actually, it's inevitable.

But more generally, you need to distinguish between the mere presence of multicolinearity, which is not a problem in its own right, and a multcolinearity problem. The first thing to remember is that no matter how great the multicolinearity among a set of variables, it in no way compromises the estimates associated with the other variables in the regression. So if you have multicolinearity among variables that are included solely to adjust for their effects but whose effects are not directly of interest, then you can ignore it altogether. Don't waste even a second thinking about it. The second thing to remember is that, what multicolinearity does to the variables that are entangled in it, is increase the standard errors of the estimated coefficients. It does not bias the coefficient estimates: it just decreases efficiency. So the way to think about multicolinearity involving a variable whose coefficient is actually important for your research goals, is to look at the standard error (or, equivalently, the confidence interval). If the standard error is small enough (or, equivalently, if the confidence interval is narrow enough) that your research goals can be achieved, then you do not have a problem. This, of course, depends on your research goals and their context, and is not a statistical issue: it's a real world issue. Think about it this way: is the confidence interval so wide that it would actually matter from a practical perspective which end of the confidence interval the true value of the coefficient lay close to? If not, you have no problem. If so, then you really do have a problem.

Unfortunately, if you do have a problem, it is a problem that lacks a good solution. You can, of course, break the colinearity by removing some (often just one suffices) of the variables involved from the regression. But why was that variable (or those variables) included to begin with? Unless you are in the habit of including superfluous variables in your regressions, there is presumably a good reason why you needed that variable, and removing it may result in biasing your estimates of other variables (including the one of greatest interest) due to confounding (also known as missing variable bias). So what other solutions to this problem are there? Nothing easy. You can always overcome multicolinearity (unless it is a perfectly linear relationship among the variables) by getting more data. But usually it requires a massive amount of additional data, an amount that is prohibitive in practical terms. There is another approach to breaking multicolinearity: throw out your data and start over with a different study design. For example, using matched pairs, or stratified sampling can overcome multicolinearity. But a drastic step like that is obviously a last resort.

Got it! Thank you very much for this detailed and insightful reply!
1 like
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#4

11 Oct 2016, 09:34

I agree with Clyde. The one thing I will add is that including interaction terms and squared terms can sometimes cause estimation problems. This can occur if, say, you are multiplying big numbers by big numbers and getting monstrous numbers as a result. It isn't a problem with the model so much as it is a problem of numerical precision. You may therefore wish to consider the following:
Rescale some variables, e.g. measure income in thousands of dollars instead of dollars.

Center continuous independent variables, e.g. subtract the mean from each case. Or, subtract some other value such that 0 becomes a more meaningful value. For example, in the US subtracting 12 from years of education would mean that 0 = high school graduate.

Even apart from estimation issues, this can make coefficients easier to interpret. For example a coefficient of 0.000000 that is highly significant is hard to make much sense of! The real value might be more like 0.000000472. By rescaling the variable those digits will show up in the printed output.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
7 likes
Comment
Tianzhu Nie

Join Date: Feb 2016

Posts: 18
#5

12 Oct 2016, 22:50

Originally posted by Richard Williams View Post

I agree with Clyde. The one thing I will add is that including interaction terms and squared terms can sometimes cause estimation problems. This can occur if, say, you are multiplying big numbers by big numbers and getting monstrous numbers as a result. It isn't a problem with the model so much as it is a problem of numerical precision. You may therefore wish to consider the following:
Rescale some variables, e.g. measure income in thousands of dollars instead of dollars.

Center continuous independent variables, e.g. subtract the mean from each case. Or, subtract some other value such that 0 becomes a more meaningful value. For example, in the US subtracting 12 from years of education would mean that 0 = high school graduate.

Even apart from estimation issues, this can make coefficients easier to interpret. For example a coefficient of 0.000000 that is highly significant is hard to make much sense of! The real value might be more like 0.000000472. By rescaling the variable those digits will show up in the printed output.

Thank you very much! They are really helpful!
Comment
Grace Rose

Join Date: Mar 2021

Posts: 6
#6

25 Mar 2021, 11:27

Hi I was wondering if someone could help me. STATA keeps omitting my variable because of collinearity. I am using difference-in-differences. My dependent variable is binary so I am using a probit model- it is 1 if the (political) party was not re-elected and 0 if it was. My group variable is 0 if flooded (treatment group) and 1 if not flooded (control group). My time variable is 0 if 2001 and 1 if 2005 and i have a treatment effect which is the interaction between these two terms. I would like to add another variable and interact it with the treatment effect which is also a dummy variable, flood size- 1 if large flood, 0 if small flood. But, everytime I add it, STATA omits the variables due to collinearity. Any advice how to fix this? This variable is important for my model.
Comment
Maryam Ghasemi

Join Date: Jul 2022

Posts: 17
#7

02 Nov 2022, 04:49

Grace alvarez
Hi
Have you found a solution for this problem?! I am having the same issue
Comment
Minhaj uddin

Join Date: Dec 2023

Posts: 45
#8

26 Aug 2024, 00:16

In my opinion, the VIF is one of the world's most over-rated statistics, and multicolinearity one of the world's most over-rated statistical issues.

High correlation among interaction terms and main effects is normal, and expected. Actually, it's inevitable.

But more generally, you need to distinguish between the mere presence of multicolinearity, which is not a problem in its own right, and a multcolinearity problem. The first thing to remember is that no matter how great the multicolinearity among a set of variables, it in no way compromises the estimates associated with the other variables in the regression. So if you have multicolinearity among variables that are included solely to adjust for their effects but whose effects are not directly of interest, then you can ignore it altogether. Don't waste even a second thinking about it. The second thing to remember is that, what multicolinearity does to the variables that are entangled in it, is increase the standard errors of the estimated coefficients. It does not bias the coefficient estimates: it just decreases efficiency. So the way to think about multicolinearity involving a variable whose coefficient is actually important for your research goals, is to look at the standard error (or, equivalently, the confidence interval). If the standard error is small enough (or, equivalently, if the confidence interval is narrow enough) that your research goals can be achieved, then you do not have a problem. This, of course, depends on your research goals and their context, and is not a statistical issue: it's a real world issue. Think about it this way: is the confidence interval so wide that it would actually matter from a practical perspective which end of the confidence interval the true value of the coefficient lay close to? If not, you have no problem. If so, then you really do have a problem.

Unfortunately, if you do have a problem, it is a problem that lacks a good solution. You can, of course, break the colinearity by removing some (often just one suffices) of the variables involved from the regression. But why was that variable (or those variables) included to begin with? Unless you are in the habit of including superfluous variables in your regressions, there is presumably a good reason why you needed that variable, and removing it may result in biasing your estimates of other variables (including the one of greatest interest) due to confounding (also known as missing variable bias). So what other solutions to this problem are there? Nothing easy. You can always overcome multicolinearity (unless it is a perfectly linear relationship among the variables) by getting more data. But usually it requires a massive amount of additional data, an amount that is prohibitive in practical terms. There is another approach to breaking multicolinearity: throw out your data and start over with a different study design. For example, using matched pairs, or stratified sampling can overcome multicollinearity. But a drastic step like that is obviously a last resort.

Prof. Clyde, can running separate models for important interaction terms be a viable solution in this situation? For instance, as shown in the attached image, the author has run three different models, each involving one of three risk variables (DOUBT, STDOUBT, and Provision).

By the way, what is the rationale behind running a separate model for each important variable while controlling for other variables? By doing this, doesn’t the model fail to account for the effects of the other risk variables? Ideally, in a real-world scenario, a firm would be exposed to all three risks.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#9

26 Aug 2024, 09:29

I can't discern from #8 what specific question is being asked or what problem is to be solved. The original topic in this thread referred to a situation with an interaction term. But #8 seems to revolve around three effects that "a firm would be exposed to" but says nothing about any interaction.

More generally, the topic of what variables to include in a regression is a complicated one and is, in general, far more a substantive question than a statistical one. A fuller explanation of what the variables in these models are and what their relationships to each other and to the outcome variable are would be necessary in order to comment sensibly on how one might use them in modeling. (Although necessary, it might not be sufficient for me because evidently this has something to do with finance or economics, which are not fields that I possess more than lay knowledge of.)
Comment
Minhaj uddin

Join Date: Dec 2023

Posts: 45
#10

26 Aug 2024, 14:54

Given the high likelihood of multicollinearity in the interaction model, could we potentially address this issue by implementing separate models for each interaction, as some people have done? For example, the attached image illustrates the use of three distinct interaction models for three different risk variables for banks. In the model mentioned, the author is looking into how bank activity diversification (shown by I-HHI and A-HHI variables) affects the relationship between risk (measured by DOUBT, SDOUBT, and Provision) and profitability (ROA) of banks.

If this approach is acceptable, wouldn't we lose the explanatory power of the model? A model that only examines one risk at a time may not adequately account for the effects of the other two risks, as a bank typically faces all three risks simultaneously. What is the rationale behind running separate models in this manner? Is there an issue if all interaction terms are used in one model only?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

26 Aug 2024, 16:21

If you run three separate models each containing one of these variables, then, to be sure, each model is ignoring the concurrent effects of the other two. And if these three variables are correlated with each other, as you seem to suggest they are, then the resulting coefficients will be subject to omitted variable bias.

If I'm understanding your question correctly, you think that a model containing all three of these measures will probably be fine, but that if you add interaction terms like (I_HHI A_HHI)##(DOUBT SDOUBT Provision) that you will end up with severe enough multicollinearity from the interaction terms that your results will become inconclusive. It depends on whether your sample size and how strong the colinearity among those terms turns out to be. There is really no good way to foresee it: trying to calculate out how severe the variance inflation would be is harder than just running the model and seeing what kind of results you get. They might be fine, or you may end up with inconclusive results.

Running models that separately include DOUBT, SDOUBT, and Provision strikes me as wrong-headed unless these three variables are independent of each other. If they are not independent of each other, then removing any of them biases the results for the ones remaining.

But it might make sense to run several models, each of them including all three of those variables, but at different values of I_HHI and A_HHI. If you do that, you will at least get a sense of how the effects of DOUBT, SDOUBT, and Provision vary according to the values of I_HHI and A_HHI. But a formal test of interaction between the HHI's and the Doubt triad will not be possible, unless your regression command is one that is compatible with the -suest- command. -suest- provides a way of combining the results of separate regressions and carrying out cross-model contrasts of coefficients, but it only works with some regression commands. It does, for example, work with -regress-, but not with -xtreg-, nor with -reghdfe-.
Comment
Minhaj uddin

Join Date: Dec 2023

Posts: 45
#12

26 Aug 2024, 18:14

Thank you so much.

Running models that separately include DOUBT, SDOUBT, and Provision strikes me as wrong-headed unless these three variables are independent of each other. If they are not independent of each other, then removing any of them biases the results for the ones remaining.

So if they are independent, can we separately include them? Please let me know if I am wrong, but even when variables are independent but relevant in explaining the outcome variable, their exclusion can change the estimated coefficients of the remaining variables in the model.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

#13

26 Aug 2024, 21:53

Well, in theory, no. If we have x1 and x2 independent of each other but both correlated with y, -regress y x1- and -regress y x1 x2- then the coefficient of x1 will indeed be the same in both regressions. Here's a simple example:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(x1 x2 y)
0 0   -.09732883
0 1     .3433237
1 0     .4103869
1 1     .9678739
0 0     .1596165
0 0   -.11359944
0 0   -.06579792
0 0   .035887778
0 0   -.13054179
0 0    .04477812
0 0    .05863484
0 0    .12823188
0 0    -.0982835
0 0    .01621547
0 0    .11930546
0 0    .05122598
0 0   -.02159614
0 0    .06590279
0 0    .00351517
0 0    .02952682
0 0     -.211529
0 0 .00012491283
0 0   -.09762138
0 0   .031608082
0 0    .07930533
0 0  -.064515755
0 0    .02307951
0 0   -.14832808
0 1    .55406016
0 1     .5383085
0 1     .4442332
0 1     .5667423
0 1     .4115534
0 1     .5024005
0 1     .3292511
0 1     .5532758
0 1    .50323164
0 1      .421922
0 1     .5599407
0 1     .3933133
0 1    .58352697
0 1    .52956873
0 1     .4361644
0 1      .540294
0 1     .4630217
0 1     .3226393
0 1     .5985385
0 1     .4270988
0 1     .5290598
0 1     .3418872
0 1     .4142972
0 1     .4377194
1 0     .4241637
1 0     .6019531
1 0    .29543564
1 0     .6395085
1 0    .35472515
1 0     .5333809
1 0    .26110044
1 0     .4396522
1 0     .5066778
1 0     .5144302
1 0     .5481147
1 0     .4575447
1 0     .6414205
1 0     .4236067
1 0     .4388893
1 0     .6253676
1 0     .4312668
1 0    .57299006
1 0     .4455705
1 0     .4463347
1 0     .3159328
1 0    .38698184
1 0     .6019938
1 0     .3039691
1 1     .9046689
1 1     .8086829
1 1     .9228743
1 1     .8297083
1 1    1.0043186
1 1    1.0856181
1 1     1.134725
1 1     .9955599
1 1    1.0247834
1 1    1.0099474
1 1     .8691434
1 1     .9274579
1 1    1.0483361
1 1     .9330862
1 1     .9543166
1 1     .9585025
1 1    1.0114597
1 1    1.0017244
1 1     .9051694
1 1     .9418585
1 1     .9418877
1 1    1.0281725
1 1    1.0327027
1 1    1.1342286
end

corr x1 x2 y
regress y x1
regress y x2
regress y x1 x2

Notice that the coefficients of x1 and x2 are the same in each of the regressions they appear in. The standard errors differ, and in this case the model with both variables actually has smaller standard errors. That is because x1 and x2 are both also correlated with y. But if y were correlated with x1, but not x2, then the standard error of the coefficient of x1 would go up in the model that contains both variables. You can experiment with the code above by creating new versions of y that have different patterns of correlation with x1 and x2.

Now, in reality, it is rare to come up with two independent variables that are correlated exactly 0. When you do, they are usually both indicator variables in a completely balanced pattern, just like the example shown above. But if the correlation between any two variables x1 and x2 is very close to zero, then the coefficients of x1 in the two regressions will be almost exactly equal.

Comment

Minhaj uddin

Join Date: Dec 2023
Posts: 45

#14

01 Sep 2024, 05:50

Thank you, Prof. Clyde

I have one question related to the appropriateness of model construction.

I’ve got a 24-year panel dataset and am looking to understand the impact of two major events on firm performance (ROA). The events are the Financial Crisis of 2007 and the COVID-19 pandemic.

Here’s what I’ve done so far:

Created dummy variables for the crisis periods:
- Post-GFC for after the Financial Crisis
- COVID for the pandemic period
Ran separate regressions for each crisis dummy.
Added several interaction terms between the crisis dummies and other variables.

I’m unsure if this is the right approach. Is running separate regressions and using interactions the best way to model the effects of these crises?

I’ve attached the results from my two models.

Code:

 GFC Regression
COVID regression

Irisk
0.0004542
0.0011902
Irisk
0.000807
0.000156

CAR
-0.0005235
-0.0014905
CAR
-0.00103
-0.00042

Liq
0.2286326
0.3114155
Liq
0.303074
0.342128

LerAdv
0.0209964
0.0811213
LerAdv
0.008476
0.00776

PostGFC
-0.0551871
0.0966423
COVID
-0.06048
-0.05621

c.Irisk#c.PostGFC

-0.0016471
c.Irisk#c.COVID

0.001674

c.CAR#c.PostGFC

0.0021279
c.CAR#c.COVID

-0.0001

c.Liq#c.PostGFC

-0.2070424
c.Liq#c.COVID

-0.49976

c.LerAdv#c.PostGFC

-0.0688134
c.LerAdv#c.COVID

0.097662

Your insights on this would be greatly appreciated and would be a tremendous help to me.

Thank you

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#15

01 Sep 2024, 10:12

I'm reluctant to comment on this. First, I would never endorse any model of anything as "the best way" to model something. The real world is much too complicated for any model that is small enough to actually work with and understand to be "the best way" to model it. There are always omitted variables that could improve the model if they were included, if suitable data were available and adequate computing resources accessible, etc. The question is whether the model is good enough for whatever application you will apply it to. And that is more of a substantive question than a statistical one. In my own field I might offer advice on substantive questions, but I can't do that in a field like finance that I know next to nothing about.

That said, I probably would not do it the way you show. I don't know how you have implemented your PostGFC variable: at what point do you consider the crisis to have ended? (You also say it starts in 2007, although most people, I believe, say it began in late 2008. Why?) If PostGFC really means any year after the start of the GFC, then the Covid era is also part of the PostGFC era. So any Covid effect would best be captured as something overlaid on the PostGFC effect that was still operative in the same years. That really calls for including the PostGFC and Covid variables (and their interactions with the other variables) in the model together.

Even if you defined PostGFC so that it returned to 0 before the Covid era started, using the separate model for Covid would only be appropriate if it is reasonable to believe that the real-world effects of the other variables returned to their pre-GFC values once the PostGFC era ended. Is that a reasonable assumption? I have no idea--but you should. If you are unsure, it would be safer to assume that they do not return to their pre-GFC values and allow the data to speak to this issue themselves. That could be done by eliminating the PostGFC variable and replacing it with a three-level variable: 0 for pre-GFC, 1 for during GFC, and 2 for after GFC ends. I would then use that, and the Covid variable, and the interactions with the other variables in a single model. (Well--it depends on when you consider the GFC to have ended. If the post-GFC period is actually the same as the COVID period, or nearly so, then this approach will make the post-GFC level of the GFC variable colinear with the Covid variable, or nearly so, which will be problematic for interpretation.) But if you consider the GFC to have ended in, say, 2015, and the Covid era doesn't begin until 2020 (or maybe 2019 if your data extends to the parts of the world, mostly China, where Covid activity was recognized in 2019), then I think the two would be separate enough to avoid this problem.

One technical issue. Your PostGFC and COVID variables should not be prefixed with c. They are discrete variables, and need to be prefixed with i. in the regressions. For the results you have so far, this will make no difference. But if you use the -margins- command to assist in interpreting your results, the c. prefix will cause -margins- to mishandle them. And things will be far worse than that if you end up going to a three-level variable: c.-prefixing that one will get the regression results very wrong.
1 like
Comment

Announcement