Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to test for multicollinearity between categorical variables

    Hi all,

    I have a number of categorical variables in my regression model including income, employment status and education which could be correlated with each other. Am I correct in assuming that ANOVA can test for multicollinearity or is there a better way to test for multicollinearity between categorical variables?

    Thank you,
    Shruti

  • #2
    Try the VIF: variance inflation factor

    Comment


    • #3
      Also simply do a correlation matrix and check whether any two variables have a correlation coefficient superior to 0.8

      Comment


      • #4
        Shruti: Why are you "testing" for multicollinearity in the first place? The VIFs that you get are what they are. How is the precision of your estimated regression coefficients?

        Comment


        • #5
          When our model specification is guided by a procrustean theory, I agree with Jeff Wooldridge that the VIF “is what it is”. However, sometimes diagnosing multicollinearity suggests an alternative specification which is equally justifiable on theoretical grounds.

          We know that multicollinearity inflates the estimated variance of an estimated regression coefficient, thus shrinking its t-statistic and making it harder to reject the null hypothesis that the coefficient is zero. The VIF statistic as estimated by Stata’s command -estat vif-, issued after estimation with -regress- or -anova- , is useful for learning how inflated by multicollinearity is the variance estimate of each continuous regressor.

          However, for a categorical regressor with more than two categories (like income, employment status and education in Shruti Gorsia ‘s problem) the VIF statistic is less useful.

          Stata’s downloadable data file nlsy80.dta has education variables for individuals as well as for both the mother and the father. If we treat these education variables as categorical, and the individual’s age as continuous, we might ask how well the education categorical variables explain the individual’s wage, after controlling for age.

          First, suppose we only include the individual’s own education and look at the VIF estimates.

          Code:
          Code:
          webuse nlsy80, clear
          anova  wage c.age  i.educ
          reg wage c.age  i.educ
          estat vif
          Results :
          Code:
           
          . anova  wage c.age  i.educ
           
                                   Number of obs =        935    R-squared     =  0.1371
                                   Root MSE      =    377.639    Adj R-squared =  0.1278
           
                            Source | Partial SS         df         MS        F    Prob>F
                        -----------+----------------------------------------------------
                             Model |   20943640         10     2094364     14.69  0.0000
                                   |
                               age |  4278188.5          1   4278188.5     30.00  0.0000
                              educ |   17193627          9     1910403     13.40  0.0000
                                   |
                          Residual |  1.318e+08        924   142610.96 
                        -----------+----------------------------------------------------
                             Total |  1.527e+08        934   163507.67 
           
          . reg
           
                Source |       SS           df       MS      Number of obs   =       935
          -------------+----------------------------------   F(10, 924)      =     14.69
                 Model |  20943639.7        10  2094363.97   Prob > F        =    0.0000
              Residual |   131772528       924  142610.962   R-squared       =    0.1371
          -------------+----------------------------------   Adj R-squared   =    0.1278
                 Total |   152716168       934  163507.675   Root MSE        =    377.64
           
          ------------------------------------------------------------------------------
                  wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
          -------------+----------------------------------------------------------------
                   age |   22.29937   4.071355     5.48   0.000     14.30919    30.28954
                       |
                  educ |
                   10  |  -30.60354   135.4815    -0.23   0.821    -296.4907    235.2836
                   11  |   84.23794   132.7789     0.63   0.526    -176.3452    344.8211
                   12  |   133.1748    121.206     1.10   0.272    -104.6962    371.0457
                   13  |    235.317   126.7803     1.86   0.064    -13.49362    484.1277
                   14  |   281.2789   127.5009     2.21   0.028     31.05392    531.5039
                   15  |   378.1296   132.3847     2.86   0.004       118.32    637.9391
                   16  |   395.6623   123.8422     3.19   0.001     152.6177    638.7069
                   17  |    451.714   133.6645     3.38   0.001     189.3928    714.0352
                   18  |   437.2439    129.487     3.38   0.001     183.1213    691.3666
                       |
                 _cons |  -10.63771   186.5458    -0.06   0.955    -376.7403    355.4649
          ------------------------------------------------------------------------------
           
          . estat vif
           
              Variable |       VIF       1/VIF 
          -------------+----------------------
                   age |      1.05    0.953720
                  educ |
                   10  |      4.34    0.230618
                   11  |      5.07    0.197185
                   12  |     23.47    0.042611
                   13  |      8.71    0.114822
                   14  |      8.05    0.124154
                   15  |      5.26    0.189971
                   16  |     13.54    0.073836
                   17  |      4.80    0.208473
                   18  |      6.29    0.158907
          -------------+----------------------
              Mean VIF |      8.06
          The ANOVA estimate tells us that the categorical variable -i.educ- is highly statistically significant because the F-value testing the hypothesis that this categorical variable has no effect is 13.4 and its p-value is less than 0.000.

          The VIF estimate for the continuous variable, -age-, is only slightly greater than 1.0, which tells us that the R^2 of -age- regressed on the categorical variable -i.educ- has a very small R^2. Thus, the estimated variance of the coefficient of -age- is hardly inflated at all, good news for hypothesis tests about the effect of -age- using these data

          However, the VIF statistics for the individual values of the categorical variable -i.educ- appear to be alarmingly high. Almost all of the statistics are greater than 5 and one attains the value 23. Should we be concerned that the categorical variable is so highly multi-collinear that we should consider excluding it from the regression?

          The answer is that we should NOT be concerned about the high values of the VIF for the dummies which together capture the categorical variable -i.educ-. The high values of VIF for -i.educ- reflect the fact that the set of dummies used to represent a categorical variable are necessarily correlated with one another. These high values show that the VIF is not useful for a categorical variable with more than two values and should be ignored.

          The limitations of the VIF for understanding multicollinearity among two or more categorical variables are even more apparent. Suppose we want to estimate the impact of mother’s and father’s education after controlling for the individual’s education and age.

          Compare the following three anova estimates:

          Code:
          Code:
          webuse nlsy80, clear
          anova  wage c.age  i.educ i.meduc i.feduc
          estat vif
          anova  wage c.age  i.educ i.meduc
          anova  wage c.age  i.educ i.feduc
          Results :
          Code:
          . anova  wage c.age  i.educ i.meduc i.feduc
           
                                   Number of obs =        722    R-squared     =  0.1989
                                   Root MSE      =    377.218    Adj R-squared =  0.1456
           
                            Source | Partial SS         df         MS        F    Prob>F
                        -----------+----------------------------------------------------
                             Model |   23884360         45   530763.56      3.73  0.0000
                                   |
                               age |  3531288.8          1   3531288.8     24.82  0.0000
                              educ |  5901530.2          9   655725.58      4.61  0.0000
                             meduc |  2981407.2         18   165633.73      1.16  0.2856
                             feduc |  3197679.7         17    188098.8      1.32  0.1716
                                   |
                          Residual |   96190434        676   142293.54 
                        -----------+----------------------------------------------------
                             Total |  1.201e+08        721   166539.24 
           
          . estat vif
          
          <Useless output of -estat vif- omitted>
          .
          . anova  wage c.age  i.educ i.meduc
           
                                   Number of obs =        857    R-squared     =  0.1736
                                   Root MSE      =    376.414    Adj R-squared =  0.1457
           
                            Source | Partial SS         df         MS        F    Prob>F
                        -----------+----------------------------------------------------
                             Model |   24648702         28    880310.8      6.21  0.0000
                                   |
                               age |  3850366.6          1   3850366.6     27.18  0.0000
                              educ |   10256519          9   1139613.2      8.04  0.0000
                             meduc |  4862267.1         18   270125.95      1.91  0.0128
                                   |
                          Residual |  1.173e+08        828   141687.26 
                        -----------+----------------------------------------------------
                             Total |  1.420e+08        856   165847.85 
           
          .
          . anova  wage c.age  i.educ i.feduc
           
                                   Number of obs =        741    R-squared     =  0.1752
                                   Root MSE      =    375.544    Adj R-squared =  0.1440
           
                            Source | Partial SS         df         MS        F    Prob>F
                        -----------+----------------------------------------------------
                             Model |   21359347         27   791086.93      5.61  0.0000
                                   |
                               age |  4147714.3          1   4147714.3     29.41  0.0000
                              educ |  6643255.1          9   738139.45      5.23  0.0000
                             feduc |  4853057.4         17   285473.97      2.02  0.0085
                                   |
                          Residual |  1.006e+08        713   141033.65 
                        -----------+----------------------------------------------------
                             Total |  1.219e+08        740   164751.81

          When both parents’ education levels are included among the determinants, as in the first of the three above anova estimates, the p-values on those categorical variables are both well above 0.1, suggesting that, after controlling for an individual’s age and his or her own education, neither contributes to explaining wage. Could this be because the two categorical variables, -i.meduc- and -i.feduc- are too highly correlated with one another (as would be predicted by the theory of assortative mating)? The -estat vif- command is completely useless for answering this question.

          To diagnose multicollinearity in this case, we omit the categorical variable representing the education of either of the two parents, as in the second and third of the above anova estimates. We then see that the categorical variable representing the other parent’s education is indeed statistically significant at either the 5% level (mother’s education) or the 1% level (father’s education). This demonstrates that multicollinearity between the two categorical variables, -i.meduc- and -i.feduc- is “inflating the variances” of these two categorical variables and thus shrinking the F-statistic which tests whether either of these categorical variables is a statistically significant contributor to explain wage variance.

          As is often the case when two variables are multicollinear, the relevant test can be whether the two are jointly significant. In this case, that test can be performed with Stata’s command -test- like this:

          Code:
          . webuse nlsy80, clear
          
          . anova  wage c.age  i.educ i.meduc i.feduc
          
          . test i.meduc i.feduc
           
                            Source | Partial SS         df         MS        F    Prob>F
                       ------------+----------------------------------------------------
                       meduc feduc |  7590003.4         35   216857.24      1.52  0.0285
                          Residual |   96190434        676   142293.54
          Thus, although neither of the categorical variables is statistically significant in the full equation, the two are jointly so, with a p-value of 0.0285.

          All this to say that, in order to diagnose multicollinearity between a categorical variable with more than two values and other categorical or continuous variables, it would be useful to have a generalized version of the VIF.

          Comment

          Working...
          X