How to test for multicollinearity between categorical variables

Shruti Gorsia

Join Date: Mar 2022

Posts: 5
#1

How to test for multicollinearity between categorical variables

15 Apr 2022, 03:02

Hi all,

I have a number of categorical variables in my regression model including income, employment status and education which could be correlated with each other. Am I correct in assuming that ANOVA can test for multicollinearity or is there a better way to test for multicollinearity between categorical variables?

Thank you,
Shruti
Tags: None
Maxence Morlet

Join Date: Mar 2021

Posts: 650
#2

15 Apr 2022, 07:49

Try the VIF: variance inflation factor
Comment
Maxence Morlet

Join Date: Mar 2021

Posts: 650
#3

15 Apr 2022, 07:50

Also simply do a correlation matrix and check whether any two variables have a correlation coefficient superior to 0.8
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2121
#4

15 Apr 2022, 08:10

Shruti: Why are you "testing" for multicollinearity in the first place? The VIFs that you get are what they are. How is the precision of your estimated regression coefficients?
1 like
Comment

Mead Over

Join Date: Sep 2014
Posts: 110

15 Apr 2022, 11:22

When our model specification is guided by a procrustean theory, I agree with Jeff Wooldridge that the VIF “is what it is”. However, sometimes diagnosing multicollinearity suggests an alternative specification which is equally justifiable on theoretical grounds.

We know that multicollinearity inflates the estimated variance of an estimated regression coefficient, thus shrinking its t-statistic and making it harder to reject the null hypothesis that the coefficient is zero. The VIF statistic as estimated by Stata’s command -estat vif-, issued after estimation with -regress- or -anova- , is useful for learning how inflated by multicollinearity is the variance estimate of each continuous regressor.

However, for a categorical regressor with more than two categories (like income, employment status and education in Shruti Gorsia ‘s problem) the VIF statistic is less useful.

Stata’s downloadable data file nlsy80.dta has education variables for individuals as well as for both the mother and the father. If we treat these education variables as categorical, and the individual’s age as continuous, we might ask how well the education categorical variables explain the individual’s wage, after controlling for age.

First, suppose we only include the individual’s own education and look at the VIF estimates.

Code:

Code:

webuse nlsy80, clear
anova  wage c.age  i.educ
reg wage c.age  i.educ
estat vif

Results :

Code:

 
. anova  wage c.age  i.educ
 
                         Number of obs =        935    R-squared     =  0.1371
                         Root MSE      =    377.639    Adj R-squared =  0.1278
 
                  Source | Partial SS         df         MS        F    Prob>F
              -----------+----------------------------------------------------
                   Model |   20943640         10     2094364     14.69  0.0000
                         |
                     age |  4278188.5          1   4278188.5     30.00  0.0000
                    educ |   17193627          9     1910403     13.40  0.0000
                         |
                Residual |  1.318e+08        924   142610.96 
              -----------+----------------------------------------------------
                   Total |  1.527e+08        934   163507.67 
 
. reg
 
      Source |       SS           df       MS      Number of obs   =       935
-------------+----------------------------------   F(10, 924)      =     14.69
       Model |  20943639.7        10  2094363.97   Prob > F        =    0.0000
    Residual |   131772528       924  142610.962   R-squared       =    0.1371
-------------+----------------------------------   Adj R-squared   =    0.1278
       Total |   152716168       934  163507.675   Root MSE        =    377.64
 
------------------------------------------------------------------------------
        wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   22.29937   4.071355     5.48   0.000     14.30919    30.28954
             |
        educ |
         10  |  -30.60354   135.4815    -0.23   0.821    -296.4907    235.2836
         11  |   84.23794   132.7789     0.63   0.526    -176.3452    344.8211
         12  |   133.1748    121.206     1.10   0.272    -104.6962    371.0457
         13  |    235.317   126.7803     1.86   0.064    -13.49362    484.1277
         14  |   281.2789   127.5009     2.21   0.028     31.05392    531.5039
         15  |   378.1296   132.3847     2.86   0.004       118.32    637.9391
         16  |   395.6623   123.8422     3.19   0.001     152.6177    638.7069
         17  |    451.714   133.6645     3.38   0.001     189.3928    714.0352
         18  |   437.2439    129.487     3.38   0.001     183.1213    691.3666
             |
       _cons |  -10.63771   186.5458    -0.06   0.955    -376.7403    355.4649
------------------------------------------------------------------------------
 
. estat vif
 
    Variable |       VIF       1/VIF 
-------------+----------------------
         age |      1.05    0.953720
        educ |
         10  |      4.34    0.230618
         11  |      5.07    0.197185
         12  |     23.47    0.042611
         13  |      8.71    0.114822
         14  |      8.05    0.124154
         15  |      5.26    0.189971
         16  |     13.54    0.073836
         17  |      4.80    0.208473
         18  |      6.29    0.158907
-------------+----------------------
    Mean VIF |      8.06

The ANOVA estimate tells us that the categorical variable -i.educ- is highly statistically significant because the F-value testing the hypothesis that this categorical variable has no effect is 13.4 and its p-value is less than 0.000.

The VIF estimate for the continuous variable, -age-, is only slightly greater than 1.0, which tells us that the R^2 of -age- regressed on the categorical variable -i.educ- has a very small R^2. Thus, the estimated variance of the coefficient of -age- is hardly inflated at all, good news for hypothesis tests about the effect of -age- using these data

However, the VIF statistics for the individual values of the categorical variable -i.educ- appear to be alarmingly high. Almost all of the statistics are greater than 5 and one attains the value 23. Should we be concerned that the categorical variable is so highly multi-collinear that we should consider excluding it from the regression?

The answer is that we should NOT be concerned about the high values of the VIF for the dummies which together capture the categorical variable -i.educ-. The high values of VIF for -i.educ- reflect the fact that the set of dummies used to represent a categorical variable are necessarily correlated with one another. These high values show that the VIF is not useful for a categorical variable with more than two values and should be ignored.

The limitations of the VIF for understanding multicollinearity among two or more categorical variables are even more apparent. Suppose we want to estimate the impact of mother’s and father’s education after controlling for the individual’s education and age.

Compare the following three anova estimates:

Code:

Code:

webuse nlsy80, clear
anova  wage c.age  i.educ i.meduc i.feduc
estat vif
anova  wage c.age  i.educ i.meduc
anova  wage c.age  i.educ i.feduc

Results :

Code:

. anova  wage c.age  i.educ i.meduc i.feduc
 
                         Number of obs =        722    R-squared     =  0.1989
                         Root MSE      =    377.218    Adj R-squared =  0.1456
 
                  Source | Partial SS         df         MS        F    Prob>F
              -----------+----------------------------------------------------
                   Model |   23884360         45   530763.56      3.73  0.0000
                         |
                     age |  3531288.8          1   3531288.8     24.82  0.0000
                    educ |  5901530.2          9   655725.58      4.61  0.0000
                   meduc |  2981407.2         18   165633.73      1.16  0.2856
                   feduc |  3197679.7         17    188098.8      1.32  0.1716
                         |
                Residual |   96190434        676   142293.54 
              -----------+----------------------------------------------------
                   Total |  1.201e+08        721   166539.24 
 
. estat vif

<Useless output of -estat vif- omitted>
.
. anova  wage c.age  i.educ i.meduc
 
                         Number of obs =        857    R-squared     =  0.1736
                         Root MSE      =    376.414    Adj R-squared =  0.1457
 
                  Source | Partial SS         df         MS        F    Prob>F
              -----------+----------------------------------------------------
                   Model |   24648702         28    880310.8      6.21  0.0000
                         |
                     age |  3850366.6          1   3850366.6     27.18  0.0000
                    educ |   10256519          9   1139613.2      8.04  0.0000
                   meduc |  4862267.1         18   270125.95      1.91  0.0128
                         |
                Residual |  1.173e+08        828   141687.26 
              -----------+----------------------------------------------------
                   Total |  1.420e+08        856   165847.85 
 
.
. anova  wage c.age  i.educ i.feduc
 
                         Number of obs =        741    R-squared     =  0.1752
                         Root MSE      =    375.544    Adj R-squared =  0.1440
 
                  Source | Partial SS         df         MS        F    Prob>F
              -----------+----------------------------------------------------
                   Model |   21359347         27   791086.93      5.61  0.0000
                         |
                     age |  4147714.3          1   4147714.3     29.41  0.0000
                    educ |  6643255.1          9   738139.45      5.23  0.0000
                   feduc |  4853057.4         17   285473.97      2.02  0.0085
                         |
                Residual |  1.006e+08        713   141033.65 
              -----------+----------------------------------------------------
                   Total |  1.219e+08        740   164751.81

When both parents’ education levels are included among the determinants, as in the first of the three above anova estimates, the p-values on those categorical variables are both well above 0.1, suggesting that, after controlling for an individual’s age and his or her own education, neither contributes to explaining wage. Could this be because the two categorical variables, -i.meduc- and -i.feduc- are too highly correlated with one another (as would be predicted by the theory of assortative mating)? The -estat vif- command is completely useless for answering this question.

To diagnose multicollinearity in this case, we omit the categorical variable representing the education of either of the two parents, as in the second and third of the above anova estimates. We then see that the categorical variable representing the other parent’s education is indeed statistically significant at either the 5% level (mother’s education) or the 1% level (father’s education). This demonstrates that multicollinearity between the two categorical variables, -i.meduc- and -i.feduc- is “inflating the variances” of these two categorical variables and thus shrinking the F-statistic which tests whether either of these categorical variables is a statistically significant contributor to explain wage variance.

As is often the case when two variables are multicollinear, the relevant test can be whether the two are jointly significant. In this case, that test can be performed with Stata’s command -test- like this:

Code:

. webuse nlsy80, clear

. anova  wage c.age  i.educ i.meduc i.feduc

. test i.meduc i.feduc
 
                  Source | Partial SS         df         MS        F    Prob>F
             ------------+----------------------------------------------------
             meduc feduc |  7590003.4         35   216857.24      1.52  0.0285
                Residual |   96190434        676   142293.54

Thus, although neither of the categorical variables is statistically significant in the full equation, the two are jointly so, with a p-value of 0.0285.

All this to say that, in order to diagnose multicollinearity between a categorical variable with more than two values and other categorical or continuous variables, it would be useful to have a generalized version of the VIF.

Announcement

How to test for multicollinearity between categorical variables

Comment

Comment

Comment

Comment