Dealing with Highly Collinear Independent Variables

Neelakanda Krishna

Join Date: May 2021
Posts: 107

Dealing with Highly Collinear Independent Variables

26 Feb 2022, 22:19

Dear Stata Members

I have a panel data where my independent variables are highly COLLINEAR(Index1 to Index4). In that case, rather than dropping one or more of the collinear variables, is it legitimate to transform the variables so that we can retain them. I will demonstrate my data and results with example.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(Index1 Index2 Index3 Index4) long id int year float dep_var
5.46687       .       .       . 1 1999        0
 3.5714 53.3333 49.2386 37.4359 1 2000 .0469986
3.77717       .       .       . 1 2001        0
3.97991  55.102 35.3535 34.1837 1 2002        0
4.09675 56.6326 44.9495 43.3674 1 2003        0
3.94243  55.665 34.6342  44.335 1 2004        0
3.94921 51.4706 33.1707      50 1 2005        0
4.05847 57.0732 37.0732 48.0392 1 2006        0
3.92085 59.2233 33.4951 50.9709 1 2007        0
4.64972 58.7379 36.4078 50.9709 1 2008        0
 4.8054 57.8947 36.8421  45.933 1 2009        0
4.70902 58.3732 33.3333 44.4976 1 2010        0
4.83402 58.7678 37.9147 44.5498 1 2011        0
4.82298 57.8199 40.2844 44.0758 1 2012        0
4.66564 54.9763 44.5498 44.0758 1 2013        0
4.55899 65.3846 45.6731   43.75 1 2014        0
4.52303 68.2692 48.5577 44.2308 1 2015        0
4.86224 66.8269 49.0385 44.2308 1 2016        0
5.33097 68.2692 46.1539 48.5577 1 2017        0
5.62695 69.2308 46.1539 47.1154 1 2018        0
5.89539 71.6346 45.1923 42.7885 1 2019        0
 3.5714 53.3333 49.2386 37.4359 2 2000        .
3.77717       .       .       . 2 2001        0
3.97991  55.102 35.3535 34.1837 2 2002        0
4.09675 56.6326 44.9495 43.3674 2 2003        0
3.94243  55.665 34.6342  44.335 2 2004        .
3.94921 51.4706 33.1707      50 2 2005        .
4.05847 57.0732 37.0732 48.0392 2 2006 .5771455
3.92085 59.2233 33.4951 50.9709 2 2007        .
4.64972 58.7379 36.4078 50.9709 2 2008        .
 4.8054 57.8947 36.8421  45.933 2 2009        0
4.70902 58.3732 33.3333 44.4976 2 2010        0
4.83402 58.7678 37.9147 44.5498 2 2011        0
4.82298 57.8199 40.2844 44.0758 2 2012        0
4.66564 54.9763 44.5498 44.0758 2 2013        0
4.55899 65.3846 45.6731   43.75 2 2014        0
4.52303 68.2692 48.5577 44.2308 2 2015        0
4.86224 66.8269 49.0385 44.2308 2 2016        0
5.33097 68.2692 46.1539 48.5577 2 2017        0
5.62695 69.2308 46.1539 47.1154 2 2018        .
5.89539 71.6346 45.1923 42.7885 2 2019        0
5.46687       .       .       . 3 1999        0
 3.5714 53.3333 49.2386 37.4359 3 2000        .
3.77717       .       .       . 3 2001        .
3.97991  55.102 35.3535 34.1837 3 2002        .
4.09675 56.6326 44.9495 43.3674 3 2003        .
3.94243  55.665 34.6342  44.335 3 2004        .
3.94921 51.4706 33.1707      50 3 2005        .
4.05847 57.0732 37.0732 48.0392 3 2006        .
3.92085 59.2233 33.4951 50.9709 3 2007        0
4.64972 58.7379 36.4078 50.9709 3 2008        0
 4.8054 57.8947 36.8421  45.933 3 2009        .
4.70902 58.3732 33.3333 44.4976 3 2010        0
4.83402 58.7678 37.9147 44.5498 3 2011        0
4.82298 57.8199 40.2844 44.0758 3 2012        0
4.66564 54.9763 44.5498 44.0758 3 2013        .
4.55899 65.3846 45.6731   43.75 3 2014        0
4.52303 68.2692 48.5577 44.2308 3 2015        .
4.86224 66.8269 49.0385 44.2308 3 2016        0
5.33097 68.2692 46.1539 48.5577 3 2017        0
5.62695 69.2308 46.1539 47.1154 3 2018        0
5.89539 71.6346 45.1923 42.7885 3 2019        0
5.46687       .       .       . 4 1999        0
 3.5714 53.3333 49.2386 37.4359 4 2000        0
3.77717       .       .       . 4 2001        0
3.97991  55.102 35.3535 34.1837 4 2002        0
4.09675 56.6326 44.9495 43.3674 4 2003        .
3.94243  55.665 34.6342  44.335 4 2004        0
3.94921 51.4706 33.1707      50 4 2005        0
4.05847 57.0732 37.0732 48.0392 4 2006        0
3.92085 59.2233 33.4951 50.9709 4 2007        0
4.64972 58.7379 36.4078 50.9709 4 2008        0
 4.8054 57.8947 36.8421  45.933 4 2009        0
4.70902 58.3732 33.3333 44.4976 4 2010        0
4.83402 58.7678 37.9147 44.5498 4 2011        0
4.82298 57.8199 40.2844 44.0758 4 2012        0
4.66564 54.9763 44.5498 44.0758 4 2013        0
4.55899 65.3846 45.6731   43.75 4 2014        0
4.52303 68.2692 48.5577 44.2308 4 2015        0
4.86224 66.8269 49.0385 44.2308 4 2016        0
5.33097 68.2692 46.1539 48.5577 4 2017        0
5.62695 69.2308 46.1539 47.1154 4 2018        0
5.89539 71.6346 45.1923 42.7885 4 2019        0
5.46687       .       .       . 5 1999        .
 3.5714 53.3333 49.2386 37.4359 5 2000        .
3.77717       .       .       . 5 2001        0
3.97991  55.102 35.3535 34.1837 5 2002        0
4.09675 56.6326 44.9495 43.3674 5 2003        0
3.94243  55.665 34.6342  44.335 5 2004        .
3.94921 51.4706 33.1707      50 5 2005        0
4.05847 57.0732 37.0732 48.0392 5 2006        .
3.92085 59.2233 33.4951 50.9709 5 2007        0
4.64972 58.7379 36.4078 50.9709 5 2008        .
 4.8054 57.8947 36.8421  45.933 5 2009        0
4.70902 58.3732 33.3333 44.4976 5 2010        0
4.83402 58.7678 37.9147 44.5498 5 2011        0
4.82298 57.8199 40.2844 44.0758 5 2012        0
4.66564 54.9763 44.5498 44.0758 5 2013        0
4.55899 65.3846 45.6731   43.75 5 2014        .
4.52303 68.2692 48.5577 44.2308 5 2015        0
end
label values id id
label def id 1 "000002.SZ", modify
label def id 2 "000004.SZ", modify
label def id 3 "000005.SZ", modify
label def id 4 "000006.SZ", modify
label def id 5 "000007.SZ", modify

Code:

pwcorr dep_var Index1 Index2 Index3 Index4 , sig star(.01)

             |  dep_var   Index1   Index2   Index3   Index4
-------------+---------------------------------------------
     dep_var |   1.0000 
             |
             |
      Index1 |  -0.1242   1.0000 
             |   0.2819
             |
      Index2 |  -0.0867   0.7584*  1.0000 
             |   0.4757   0.0000
             |
      Index3 |  -0.0658   0.3183*  0.5552*  1.0000 
             |   0.5884   0.0021   0.0000
             |
      Index4 |   0.0787   0.1896   0.1301  -0.3035*  1.0000 
             |   0.5172   0.0719   0.2190   0.0034
             |

Code:

 reg dep_var Index1 Index2 Index3 i.id  i.year
note: 2017.year omitted because of collinearity.
note: 2018.year omitted because of collinearity.
note: 2019.year omitted because of collinearity.

      Source |       SS           df       MS      Number of obs   =        70
-------------+----------------------------------   F(22, 47)       =      1.28
       Model |  .123540378        22  .005615472   Prob > F        =    0.2345
    Residual |  .206200354        47  .004387242   R-squared       =    0.3747
-------------+----------------------------------   Adj R-squared   =    0.0819
       Total |  .329740732        69  .004778851   Root MSE        =    .06624

------------------------------------------------------------------------------
     dep_var | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      Index1 |   .0607583   .2623863     0.23   0.818    -.4670948    .5886114
      Index2 |  -.0082303   .0411951    -0.20   0.843    -.0911041    .0746435
      Index3 |   .0068583   .1109061     0.06   0.951     -.216256    .2299725
             |
          id |
  000004.SZ  |   .0420866   .0246522     1.71   0.094    -.0075072    .0916804
  000005.SZ  |   .0090231   .0272016     0.33   0.742    -.0456995    .0637458
  000006.SZ  |  -.0035913   .0218879    -0.16   0.870     -.047624    .0404413
  000007.SZ  |   .0108503   .0272195     0.40   0.692    -.0439084    .0656089
             |
        year |
       2002  |   .0473329    1.50762     0.03   0.975    -2.985606    3.080272
       2003  |  -.0182899   .4098085    -0.04   0.965    -.8427182    .8061384
       2004  |    .073309   1.573491     0.05   0.963    -3.092147    3.238765
       2005  |   .0441975   1.844819     0.02   0.981    -3.667099    3.755494
       2006  |   .2388757   1.268526     0.19   0.851     -2.31307    2.790822
       2007  |   .1058522   1.615725     0.07   0.948    -3.144567    3.356271
       2008  |   .0398561   1.309828     0.03   0.976    -2.595177    2.674889
       2009  |   .0099532   1.289834     0.01   0.994    -2.584859    2.604765
       2010  |   .0444741   1.660176     0.03   0.979    -3.295369    3.384318
       2011  |   .0087066    1.14843     0.01   0.994    -2.301636    2.319049
       2012  |  -.0146761   .9171183    -0.02   0.987     -1.85968    1.830328
       2013  |  -.0584361   .5495863    -0.11   0.916    -1.164061    1.047189
       2014  |   .0264604   .1801508     0.15   0.884    -.3359562     .388877
       2015  |   .0321464   .3668656     0.09   0.931     -.705892    .7701847
       2016  |  -.0031748   .3119827    -0.01   0.992     -.630803    .6244534
       2017  |          0  (omitted)
       2018  |          0  (omitted)
       2019  |          0  (omitted)
             |
       _cons |  -.0904379   6.801528    -0.01   0.989    -13.77335    13.59247
------------------------------------------------------------------------------

. estat vif

    Variable |       VIF       1/VIF  
-------------+----------------------
      Index1 |    349.95    0.002858
      Index2 |    886.38    0.001128
      Index3 |   6173.44    0.000162
          id |
          2  |      1.47    0.681963
          3  |      1.45    0.691748
          4  |      1.46    0.684869
          5  |      1.45    0.690839
        year |
       2002  |   1953.88    0.000512
       2003  |    109.92    0.009098
       2004  |   1096.42    0.000912
       2005  |   2227.48    0.000449
       2006  |   1053.19    0.000949
       2007  |   2244.14    0.000446
       2008  |   1122.88    0.000891
       2009  |   1430.15    0.000699
       2010  |   2916.77    0.000343
       2011  |   1395.73    0.000716
       2012  |    890.11    0.001123
       2013  |    259.65    0.003851
       2014  |     27.90    0.035844
       2015  |    115.70    0.008643
       2016  |     83.67    0.011952
-------------+----------------------
    Mean VIF |   1106.51

.
So my question is rather than dropping, can we do something to deal with Multicollinearity

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#2

26 Feb 2022, 22:53

Multicolinearity is a zombie. See Arthur Goldberger's A Course in Econometrics. There is a chapter on why worrying about multicollinearity is a waste of time and energy. It is usually not a problem at all, and when it is, there is nothing you can do about it anyway, unless you can markedly expand your data set. For a short supporting commentary by Bryan Caplan, see https://www.econlib.org/archives/200...ollineari.html.

Also, in this case, you have much bigger problems. Look at that dependent variable. It is almost always zero, with just two exceptions. In your regression, almost all of the variance is being explained by the year and id that happen to be in those two observations. But for practical purposes, this dependent variable is really just a flat constant 0 with some very occasional noise. There is nothing to model here, even if all of your explanatory variables were completely orthogonal.
3 likes
Comment
Neelakanda Krishna

Join Date: May 2021

Posts: 107
#3

26 Feb 2022, 23:15

Dear Clyde Schechter
Thanks for your reply and my own understanding on Multicollinearity is based on your excellent writings in this forum.
1. Near versus perfect Multicollinearity, "https://www.statalist.org/forums/forum/general-stata-discussion/general/1297526-multicollinearity-panel-data?p=1297657#post1297657"
2. Why VIFs are waste, https://www.statalist.org/forums/for...74#post1465874
However, often at presentations, people ask about multicollinearity, VIFs etc. Hence I am still doubtful about this

So in this case I will run models one by one by considering collinear variables one by one and not taking them all in one go. Is that fine?

I am enthralled by how clearly you could identify the issue related to the dependent variable which has many zeroes. There are some serious issues with it which I will start as a new thread as I think for new unrelated questions, I cannot use this topic again.
Once again thanks
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17749
#4

27 Feb 2022, 00:30

Neelakanda:
as an aside to Clyde's helpful reply:
1) why using -regress- without -vce(cluster panelid)- standard errors if you have panel data? And why using -regress- at all?
2) all your predictors suffer from quasi-extreme multicollinrarity. As Clyde pointed out quoting the chapter 23 of Goldberger's textbook, in general, this is not an issue, but it becomes a problem if all your independent variables suffer from it, as the regression machinery cannot disentangle the contribution of each predictor (when adjusted from the other ones) to expluan variation in the regressand (that, in your example, is basically a constant).

Kind regards,
Carlo
(Stata 19.0)
Comment

Neelakanda Krishna

Join Date: May 2021
Posts: 107

27 Feb 2022, 03:12

Dear Carlo Lazzaro
Thanks for the reply. I tried using xtset but don't know how to interpret

Code:

estat vce

as estat vif is not available after xtreg. Moreover I want to indicate that if at all I am using a pooled OLS, how multicollinearity can be a problem

Code:

 xtset id year

Panel variable: id (unbalanced)
 Time variable: year, 1999 to 2019
         Delta: 1 unit

. xtreg dep_var Index1 Index2 Index3 Index4, fe vce(robust)

Fixed-effects (within) regression               Number of obs     =         70
Group variable: id                              Number of groups  =          5

R-squared:                                      Obs per group:
     Within  = 0.0442                                         min =         10
     Between = 0.3651                                         avg =       14.0
     Overall = 0.0294                                         max =         19

                                                F(4,4)            =       1.56
corr(u_i, Xb) = -0.1297                         Prob > F          =     0.3396

                                     (Std. err. adjusted for 5 clusters in id)
------------------------------------------------------------------------------
             |               Robust
     dep_var | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      Index1 |  -.0237862   .0248318    -0.96   0.392    -.0927303    .0451579
      Index2 |   .0003254   .0008764     0.37   0.729    -.0021079    .0027587
      Index3 |  -.0000999   .0008917    -0.11   0.916    -.0025757    .0023758
      Index4 |   .0022282   .0027642     0.81   0.465    -.0054463    .0099027
       _cons |   .0033961   .0326341     0.10   0.922    -.0872108     .094003
-------------+----------------------------------------------------------------
     sigma_u |  .02158748
     sigma_e |  .06964497
         rho |  .08765628   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. estat vif
estat vif not valid
r(321);

. estat vce

Covariance matrix of coefficients of xtreg model

        e(V) |     Index1      Index2      Index3      Index4       _cons 
-------------+------------------------------------------------------------
      Index1 |  .00061662                                                 
      Index2 | -.00002068   7.681e-07                                     
      Index3 |  .00002026  -7.517e-07   7.951e-07                         
      Index4 |  -.0000681   2.363e-06  -2.339e-06   7.641e-06             
       _cons |  .00062813  -.00002614   .00002428  -.00007531   .00106499

But I am not quite sure how to interpret results from estat vce. Also given a model like mine, how will you consider variables if Index1 is my VOI? Which of the other indexes should I use in ,my model to reduce problems of Multicollinearity

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17749
#6

27 Feb 2022, 03:26

Neelakanda:
1) if you go pooled OLS (a questionable first choice with panel data), observations belonging to the same panel are not independent; that's why you should use -vce-(cluster panelid)- standard errors.
The same holds for -xtreg- if you detect heteroskedasticity and/or autocorrelatiin of the epsilon, with the relevant difference that, unlike -regress-, both -robust- and -vce(cluster idcode)- do the very same job under -xtreg-, as they both invoke cluster-robust standard errors. Conversely, the -robust- option takes only heteroskedasticity into account with -regress-;
2) while it's true that -estat vif- is not available after -xtreg-, for the same purpose you can type:

Code:

estat vce, corr

;
3) as far as you -xtreg,fe- code is concerned, I woudl say that:
a) due to the very low within R_sq your model it's likely to suffer from mispecification;
b) as sigma_e>sigma_u, the evidence of a panel-wise effect sounds unclear.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Neelakanda Krishna

Join Date: May 2021

Posts: 107
#7

27 Feb 2022, 03:32

Dear Carlo Lazzaro
Can I have one more question in this regard? Is there a way to get stars with -estat vce, corr- so that significance at various levels can be ascertained?
I do agree the model is misspecified but I want to start with some basic models. Thanks for the observation related to sigma_e>sigma_u. I never used to check them and I am helpful for those diagnostics
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17749
#8

27 Feb 2022, 03:55

Neelakanda:
not that I know.
That said, a correlation>0.75 may be suspect (between linear terms; conversely, it is expected that linear and squared terms of the same predictor are highly correlated).

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#9

27 Feb 2022, 11:04

However, often at presentations, people ask about multicollinearity, VIFs etc. Hence I am still doubtful about this

If I were asked about this at a presentation, I would respond as I have in your post and would cite Goldberger as a reference. (If you want you can find other references to this same kind of material on Google.)

So in this case I will run models one by one by considering collinear variables one by one and not taking them all in one go. Is that fine?

Probably not! I'm assuming that the Index* variables are being included as variables of interest, not just as nuisance variables whose effects must be adjusted for. So, presumably you believe that these variables are each associated with dep_var. And, as you have discovered, they are also associated with each other. Assuming that the causal direction of these relationships (if there is a causal relation) is in the direction from Index* to dep_var, then the Index* variables are each confounders of the relationships of the others with dep_var. To use only one is to doom your analysis to omitted variable bias.

The only genuine solution to the loss of precision associated with multicolinearity is to get a (usually much) larger data set. As Goldberger says, multicolinearity should properly be called micronumerosity.

Short of that, if there is some substantively meaningful way to combine the four Index* variables into a single variable that does not discard too much of the variation, then using that single variable as a proxy for the four Index* variables might produce a useful result. You often see, for example, people doing principal components analysis on the multicolinear variables and entering the components. Since principal components are orthogonal, this breaks the multicollinearity and produces precise results for the components. The problem is that the components are often meaningless in real world terms: they don't correspond to anything in the real world, so you get a precise estimation of the effect of something that doesn't exist in reality! But in the circumstances where, say, the first principal component is actually interpretable as a measure of something in the real world, then this approach may be satisfactory.
2 likes
Comment
Neelakanda Krishna

Join Date: May 2021

Posts: 107
#10

27 Feb 2022, 21:13

Dear Clyde Schechter
Thanks for the excellent explanation. As I recall from one of your posts, Multicollinearity can
1) Reverse the signs of collinear variables
2) None of collinear variables are individually a significant predictor but R^2 will be high
My main variable of interest is Index1, which is a proxy for political uncertainty (higher the value, high uncertainty). However, it can happen that Index1 is a significant predictor not because it measures political uncertainty but due to potential omitted variable bias arising from Policy uncertainty (Index2). Thus to control the impact of policy uncertainty, I can use Index2 but as it is very collinear with Index1, I fear one of the 2 problems can happen. In first case, I may have different association between dep var and Index1 owing to Index2 for which I don't have a theory and if second happens, my study is gone. Getting more data than I presently dealing with (originally) is difficult as these are secondary ones. In that case I thought either ignore Index2 so that I will have a omitted variable bias or consider both and report the results accordingly. Which is more sinful? Omitted Variable bias or Bias due to multicollinearity. I am not quite sure.
Thanks once again for the insightful description on Multicollinearity

Last edited by Neelakanda Krishna; 27 Feb 2022, 21:17.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#11

27 Feb 2022, 22:09

Multicollinearity does not cause bias. Yes, it can reverse the sign of a coefficient, but only because the standard error becomes so large that the values with opposite signs are, in effect, indistinguishable.

Nevertheless, when you get a result that is in the "wrong" direction, whether due to bias or due to imprecision, that is a problem. But, as Goldberger would say, your problem is best described not as multicolinearity but as micronumerosity.

I hate to say this, but a data set of 70 observations with two important highly-correlated variables is just not suitable for your problem. I understand that getting more data is difficult, maybe even impossible. But I am not the first person to note that having a data set and a question plus a burning desire is not sufficient to assure that the question can be adequately answered. The options I can see are:

1. Proceed with the data you have and the analysis involving Index1 and Index2, and accept the possibility that your study may be inconclusive to the extent that a joint political and policy uncertainty effect may be identifiable, but it may be impossible to distinguish their separate effects.

2. Get a much larger data set (I realize this is likely not feasible).

3. Get a different data set that samples entities in such a way that Index1 and Index2 are no longer highly correlated. For example, you might draw your sample in such a way that for every entity where Index1 and Index2 are both high (or both low) you include another where one is high and the other is low. This would be a non-random sample and it would deliberately mis-represent the correlation between Index1 and Index2 that would be found in the full population of entities, but it would be better able to discriminate policy uncertainty from political uncertainty. This is a type of stratified sampling, and the mis-representativeness of the sample can be compensated for by a suitable weighting scheme. (This approach is sometimes used in epidemiology: tobacco and alcohol addiction tend to go together. In studies where it is important to be able to disentangle their effects, the subject accrual process will sometimes require that for every person who exhibits both, or neither, of these traits, we also include a person who is discordant for them, and balance the latter with the discordance going in both directions. In the resulting sample, tobacco and alcohol addiction are [nearly] independent, and their effects can be separately estimated.)

4. Repurpose your data to study a different question that does not require unbiased and precise estimation of both the effects of Index1 and Index2. That is, study some other variables, and either forget about Index1 and Index2, or include them only as necessary covariates to avoid omitted variable bias on some other association.
3 likes
Comment
Neelakanda Krishna

Join Date: May 2021

Posts: 107
#12

27 Feb 2022, 22:28

Dear Clyde Schechter
I dont have words to express my gratitude and thanks for such an excellent discourse on Multicollinearity and its consequences and some ways to get away with it (The example in Epedemiology is an eye-opener).
I wish you a good day and I’m so thankful for everything you bring to this forum.
Comment

Announcement

Dealing with Highly Collinear Independent Variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment