How to identify the cause for high VIF of a variable?

Laiy Kho

Join Date: Oct 2022

Posts: 48
#1

How to identify the cause for high VIF of a variable?

22 Nov 2022, 08:41

I estimated an OLS regression to examine the impact of a unique independent variable X on Y(interest rates). I control for some loan specific characteristics, firm specific characteristics and macroeconomic variables and include 2 interaction terms between control variables. When I estimated the correlation matrix(excluding two factor control variables), there seems to be no significant correlation between any of the variables. The regression analysis output states all variables are significant. Furthermore, I followed it up with a linktest, and the result indicates that the model is correctly specified. However, when I checked for multicollinearity with the command estat vif, the VIFs are high for some control variables. I understand that the it is possible to have high VIF despite having low correlation coefficients. But how do I figure out what must be corrected to tackle high VIF when correlation coefficients are low? Below I have reported my estat vif command results:

Apologies, I have erased the name of some variables due to confidentiality reasons.

From the above table, loan purpose and region are factor variables. Hence, these variables were not included in correlation matrix. All the variables that have a "c" letter preceding to it implies that they have been centered. As such cloanamount implies centered loan amount. I understand some studies recommend taking 5 as cutoff for VIF. However, I am using 10 as the cutoff (Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE, 2012). I mainly use this cutoff as my sample size is really large (900k + observations). I have read that VIF threshold can be relaxed in case of large sample sizes [Source]. I also ignore the large VIFs resulting from factor variables and interaction terms as suggested in an article by Dr. Paul Allison:

2. The high VIFs are caused by the inclusion of powers or products of other variables. If you specify a regression model with both x and x², there’s a good chance that those two variables will be highly correlated. Similarly, if your model has x, z, and xz, both x and z are likely to be highly correlated with their product. This is not something to be concerned about, however, because the p-value for xz is not affected by the multicollinearity. This is easily demonstrated: you can greatly reduce the correlations by “centering” the variables (i.e., subtracting their means) before creating the powers or the products. But the p-value for x² or for xz will be exactly the same, regardless of whether or not you center. And all the results for the other variables (including the R²but not including the lower-order terms) will be the same in either case. So the multicollinearity has no adverse consequences.

3. The variables with high VIFs are indicator (dummy) variables that represent a categorical variable with three or more categories. If the proportion of cases in the reference category is small, the indicator variables will necessarily have high VIFs, even if the categorical variable is not associated with other variables in the regression model.

Suppose, for example, that a marital status variable has three categories: currently married, never married, and formerly married. You choose formerly married as the reference category, with indicator variables for the other two. What happens is that the correlation between those two indicators gets more negative as the fraction of people in the reference category gets smaller. For example, if 45 percent of people are never married, 45 percent are married, and 10 percent are formerly married, the VIFs for the married and never-married indicators will be at least 3.0.

Is this a problem? Well, it does mean that p-values for the indicator variables may be high. But the overall test that all indicators have coefficients of zero is unaffected by the high VIFs. And nothing else in the regression is affected. If you really want to avoid the high VIFs, just choose a reference category with a larger fraction of the cases. That may be desirable in order to avoid situations where none of the individual indicators is statistically significant even though the overall set of indicators is significant.

But I still have high VIFs for bond yield and the variable loan portfolio. I have tried centering them but it does not seem to significantly alleviate the issue. Can anyone advise what must be done to reduce vif or is it okay to leave it in?
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17676
#2

23 Nov 2022, 05:00

Laiy:
the -estat vif- table is the second chapter of the story, whereas the first one is what you typed and wnat Stata gave you back when you coded -regress-.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

How to identify the cause for high VIF of a variable?

Comment