Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dealing with Highly Collinear Independent Variables

    Dear Stata Members

    I have a panel data where my independent variables are highly COLLINEAR(Index1 to Index4). In that case, rather than dropping one or more of the collinear variables, is it legitimate to transform the variables so that we can retain them. I will demonstrate my data and results with example.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(Index1 Index2 Index3 Index4) long id int year float dep_var
    5.46687       .       .       . 1 1999        0
     3.5714 53.3333 49.2386 37.4359 1 2000 .0469986
    3.77717       .       .       . 1 2001        0
    3.97991  55.102 35.3535 34.1837 1 2002        0
    4.09675 56.6326 44.9495 43.3674 1 2003        0
    3.94243  55.665 34.6342  44.335 1 2004        0
    3.94921 51.4706 33.1707      50 1 2005        0
    4.05847 57.0732 37.0732 48.0392 1 2006        0
    3.92085 59.2233 33.4951 50.9709 1 2007        0
    4.64972 58.7379 36.4078 50.9709 1 2008        0
     4.8054 57.8947 36.8421  45.933 1 2009        0
    4.70902 58.3732 33.3333 44.4976 1 2010        0
    4.83402 58.7678 37.9147 44.5498 1 2011        0
    4.82298 57.8199 40.2844 44.0758 1 2012        0
    4.66564 54.9763 44.5498 44.0758 1 2013        0
    4.55899 65.3846 45.6731   43.75 1 2014        0
    4.52303 68.2692 48.5577 44.2308 1 2015        0
    4.86224 66.8269 49.0385 44.2308 1 2016        0
    5.33097 68.2692 46.1539 48.5577 1 2017        0
    5.62695 69.2308 46.1539 47.1154 1 2018        0
    5.89539 71.6346 45.1923 42.7885 1 2019        0
     3.5714 53.3333 49.2386 37.4359 2 2000        .
    3.77717       .       .       . 2 2001        0
    3.97991  55.102 35.3535 34.1837 2 2002        0
    4.09675 56.6326 44.9495 43.3674 2 2003        0
    3.94243  55.665 34.6342  44.335 2 2004        .
    3.94921 51.4706 33.1707      50 2 2005        .
    4.05847 57.0732 37.0732 48.0392 2 2006 .5771455
    3.92085 59.2233 33.4951 50.9709 2 2007        .
    4.64972 58.7379 36.4078 50.9709 2 2008        .
     4.8054 57.8947 36.8421  45.933 2 2009        0
    4.70902 58.3732 33.3333 44.4976 2 2010        0
    4.83402 58.7678 37.9147 44.5498 2 2011        0
    4.82298 57.8199 40.2844 44.0758 2 2012        0
    4.66564 54.9763 44.5498 44.0758 2 2013        0
    4.55899 65.3846 45.6731   43.75 2 2014        0
    4.52303 68.2692 48.5577 44.2308 2 2015        0
    4.86224 66.8269 49.0385 44.2308 2 2016        0
    5.33097 68.2692 46.1539 48.5577 2 2017        0
    5.62695 69.2308 46.1539 47.1154 2 2018        .
    5.89539 71.6346 45.1923 42.7885 2 2019        0
    5.46687       .       .       . 3 1999        0
     3.5714 53.3333 49.2386 37.4359 3 2000        .
    3.77717       .       .       . 3 2001        .
    3.97991  55.102 35.3535 34.1837 3 2002        .
    4.09675 56.6326 44.9495 43.3674 3 2003        .
    3.94243  55.665 34.6342  44.335 3 2004        .
    3.94921 51.4706 33.1707      50 3 2005        .
    4.05847 57.0732 37.0732 48.0392 3 2006        .
    3.92085 59.2233 33.4951 50.9709 3 2007        0
    4.64972 58.7379 36.4078 50.9709 3 2008        0
     4.8054 57.8947 36.8421  45.933 3 2009        .
    4.70902 58.3732 33.3333 44.4976 3 2010        0
    4.83402 58.7678 37.9147 44.5498 3 2011        0
    4.82298 57.8199 40.2844 44.0758 3 2012        0
    4.66564 54.9763 44.5498 44.0758 3 2013        .
    4.55899 65.3846 45.6731   43.75 3 2014        0
    4.52303 68.2692 48.5577 44.2308 3 2015        .
    4.86224 66.8269 49.0385 44.2308 3 2016        0
    5.33097 68.2692 46.1539 48.5577 3 2017        0
    5.62695 69.2308 46.1539 47.1154 3 2018        0
    5.89539 71.6346 45.1923 42.7885 3 2019        0
    5.46687       .       .       . 4 1999        0
     3.5714 53.3333 49.2386 37.4359 4 2000        0
    3.77717       .       .       . 4 2001        0
    3.97991  55.102 35.3535 34.1837 4 2002        0
    4.09675 56.6326 44.9495 43.3674 4 2003        .
    3.94243  55.665 34.6342  44.335 4 2004        0
    3.94921 51.4706 33.1707      50 4 2005        0
    4.05847 57.0732 37.0732 48.0392 4 2006        0
    3.92085 59.2233 33.4951 50.9709 4 2007        0
    4.64972 58.7379 36.4078 50.9709 4 2008        0
     4.8054 57.8947 36.8421  45.933 4 2009        0
    4.70902 58.3732 33.3333 44.4976 4 2010        0
    4.83402 58.7678 37.9147 44.5498 4 2011        0
    4.82298 57.8199 40.2844 44.0758 4 2012        0
    4.66564 54.9763 44.5498 44.0758 4 2013        0
    4.55899 65.3846 45.6731   43.75 4 2014        0
    4.52303 68.2692 48.5577 44.2308 4 2015        0
    4.86224 66.8269 49.0385 44.2308 4 2016        0
    5.33097 68.2692 46.1539 48.5577 4 2017        0
    5.62695 69.2308 46.1539 47.1154 4 2018        0
    5.89539 71.6346 45.1923 42.7885 4 2019        0
    5.46687       .       .       . 5 1999        .
     3.5714 53.3333 49.2386 37.4359 5 2000        .
    3.77717       .       .       . 5 2001        0
    3.97991  55.102 35.3535 34.1837 5 2002        0
    4.09675 56.6326 44.9495 43.3674 5 2003        0
    3.94243  55.665 34.6342  44.335 5 2004        .
    3.94921 51.4706 33.1707      50 5 2005        0
    4.05847 57.0732 37.0732 48.0392 5 2006        .
    3.92085 59.2233 33.4951 50.9709 5 2007        0
    4.64972 58.7379 36.4078 50.9709 5 2008        .
     4.8054 57.8947 36.8421  45.933 5 2009        0
    4.70902 58.3732 33.3333 44.4976 5 2010        0
    4.83402 58.7678 37.9147 44.5498 5 2011        0
    4.82298 57.8199 40.2844 44.0758 5 2012        0
    4.66564 54.9763 44.5498 44.0758 5 2013        0
    4.55899 65.3846 45.6731   43.75 5 2014        .
    4.52303 68.2692 48.5577 44.2308 5 2015        0
    end
    label values id id
    label def id 1 "000002.SZ", modify
    label def id 2 "000004.SZ", modify
    label def id 3 "000005.SZ", modify
    label def id 4 "000006.SZ", modify
    label def id 5 "000007.SZ", modify

    Code:
    pwcorr dep_var Index1 Index2 Index3 Index4 , sig star(.01)
    
                 |  dep_var   Index1   Index2   Index3   Index4
    -------------+---------------------------------------------
         dep_var |   1.0000 
                 |
                 |
          Index1 |  -0.1242   1.0000 
                 |   0.2819
                 |
          Index2 |  -0.0867   0.7584*  1.0000 
                 |   0.4757   0.0000
                 |
          Index3 |  -0.0658   0.3183*  0.5552*  1.0000 
                 |   0.5884   0.0021   0.0000
                 |
          Index4 |   0.0787   0.1896   0.1301  -0.3035*  1.0000 
                 |   0.5172   0.0719   0.2190   0.0034
                 |


    Code:
     reg dep_var Index1 Index2 Index3 i.id  i.year
    note: 2017.year omitted because of collinearity.
    note: 2018.year omitted because of collinearity.
    note: 2019.year omitted because of collinearity.
    
          Source |       SS           df       MS      Number of obs   =        70
    -------------+----------------------------------   F(22, 47)       =      1.28
           Model |  .123540378        22  .005615472   Prob > F        =    0.2345
        Residual |  .206200354        47  .004387242   R-squared       =    0.3747
    -------------+----------------------------------   Adj R-squared   =    0.0819
           Total |  .329740732        69  .004778851   Root MSE        =    .06624
    
    ------------------------------------------------------------------------------
         dep_var | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
          Index1 |   .0607583   .2623863     0.23   0.818    -.4670948    .5886114
          Index2 |  -.0082303   .0411951    -0.20   0.843    -.0911041    .0746435
          Index3 |   .0068583   .1109061     0.06   0.951     -.216256    .2299725
                 |
              id |
      000004.SZ  |   .0420866   .0246522     1.71   0.094    -.0075072    .0916804
      000005.SZ  |   .0090231   .0272016     0.33   0.742    -.0456995    .0637458
      000006.SZ  |  -.0035913   .0218879    -0.16   0.870     -.047624    .0404413
      000007.SZ  |   .0108503   .0272195     0.40   0.692    -.0439084    .0656089
                 |
            year |
           2002  |   .0473329    1.50762     0.03   0.975    -2.985606    3.080272
           2003  |  -.0182899   .4098085    -0.04   0.965    -.8427182    .8061384
           2004  |    .073309   1.573491     0.05   0.963    -3.092147    3.238765
           2005  |   .0441975   1.844819     0.02   0.981    -3.667099    3.755494
           2006  |   .2388757   1.268526     0.19   0.851     -2.31307    2.790822
           2007  |   .1058522   1.615725     0.07   0.948    -3.144567    3.356271
           2008  |   .0398561   1.309828     0.03   0.976    -2.595177    2.674889
           2009  |   .0099532   1.289834     0.01   0.994    -2.584859    2.604765
           2010  |   .0444741   1.660176     0.03   0.979    -3.295369    3.384318
           2011  |   .0087066    1.14843     0.01   0.994    -2.301636    2.319049
           2012  |  -.0146761   .9171183    -0.02   0.987     -1.85968    1.830328
           2013  |  -.0584361   .5495863    -0.11   0.916    -1.164061    1.047189
           2014  |   .0264604   .1801508     0.15   0.884    -.3359562     .388877
           2015  |   .0321464   .3668656     0.09   0.931     -.705892    .7701847
           2016  |  -.0031748   .3119827    -0.01   0.992     -.630803    .6244534
           2017  |          0  (omitted)
           2018  |          0  (omitted)
           2019  |          0  (omitted)
                 |
           _cons |  -.0904379   6.801528    -0.01   0.989    -13.77335    13.59247
    ------------------------------------------------------------------------------
    
    . estat vif
    
        Variable |       VIF       1/VIF  
    -------------+----------------------
          Index1 |    349.95    0.002858
          Index2 |    886.38    0.001128
          Index3 |   6173.44    0.000162
              id |
              2  |      1.47    0.681963
              3  |      1.45    0.691748
              4  |      1.46    0.684869
              5  |      1.45    0.690839
            year |
           2002  |   1953.88    0.000512
           2003  |    109.92    0.009098
           2004  |   1096.42    0.000912
           2005  |   2227.48    0.000449
           2006  |   1053.19    0.000949
           2007  |   2244.14    0.000446
           2008  |   1122.88    0.000891
           2009  |   1430.15    0.000699
           2010  |   2916.77    0.000343
           2011  |   1395.73    0.000716
           2012  |    890.11    0.001123
           2013  |    259.65    0.003851
           2014  |     27.90    0.035844
           2015  |    115.70    0.008643
           2016  |     83.67    0.011952
    -------------+----------------------
        Mean VIF |   1106.51
    .
    So my question is rather than dropping, can we do something to deal with Multicollinearity

  • #2
    Multicolinearity is a zombie. See Arthur Goldberger's A Course in Econometrics. There is a chapter on why worrying about multicollinearity is a waste of time and energy. It is usually not a problem at all, and when it is, there is nothing you can do about it anyway, unless you can markedly expand your data set. For a short supporting commentary by Bryan Caplan, see https://www.econlib.org/archives/200...ollineari.html.

    Also, in this case, you have much bigger problems. Look at that dependent variable. It is almost always zero, with just two exceptions. In your regression, almost all of the variance is being explained by the year and id that happen to be in those two observations. But for practical purposes, this dependent variable is really just a flat constant 0 with some very occasional noise. There is nothing to model here, even if all of your explanatory variables were completely orthogonal.

    Comment


    • #3
      Dear Clyde Schechter
      Thanks for your reply and my own understanding on Multicollinearity is based on your excellent writings in this forum.
      1. Near versus perfect Multicollinearity, "https://www.statalist.org/forums/forum/general-stata-discussion/general/1297526-multicollinearity-panel-data?p=1297657#post1297657"
      2. Why VIFs are waste, https://www.statalist.org/forums/for...74#post1465874
      However, often at presentations, people ask about multicollinearity, VIFs etc. Hence I am still doubtful about this

      So in this case I will run models one by one by considering collinear variables one by one and not taking them all in one go. Is that fine?

      I am enthralled by how clearly you could identify the issue related to the dependent variable which has many zeroes. There are some serious issues with it which I will start as a new thread as I think for new unrelated questions, I cannot use this topic again.
      Once again thanks

      Comment


      • #4
        Neelakanda:
        as an aside to Clyde's helpful reply:
        1) why using -regress- without -vce(cluster panelid)- standard errors if you have panel data? And why using -regress- at all?
        2) all your predictors suffer from quasi-extreme multicollinrarity. As Clyde pointed out quoting the chapter 23 of Goldberger's textbook, in general, this is not an issue, but it becomes a problem if all your independent variables suffer from it, as the regression machinery cannot disentangle the contribution of each predictor (when adjusted from the other ones) to expluan variation in the regressand (that, in your example, is basically a constant).
        Kind regards,
        Carlo
        (StataNow 18.5)

        Comment


        • #5
          Dear Carlo Lazzaro
          Thanks for the reply. I tried using xtset but don't know how to interpret
          Code:
          estat vce
          as estat vif is not available after xtreg. Moreover I want to indicate that if at all I am using a pooled OLS, how multicollinearity can be a problem

          Code:
           xtset id year
          
          Panel variable: id (unbalanced)
           Time variable: year, 1999 to 2019
                   Delta: 1 unit
          
          . xtreg dep_var Index1 Index2 Index3 Index4, fe vce(robust)
          
          Fixed-effects (within) regression               Number of obs     =         70
          Group variable: id                              Number of groups  =          5
          
          R-squared:                                      Obs per group:
               Within  = 0.0442                                         min =         10
               Between = 0.3651                                         avg =       14.0
               Overall = 0.0294                                         max =         19
          
                                                          F(4,4)            =       1.56
          corr(u_i, Xb) = -0.1297                         Prob > F          =     0.3396
          
                                               (Std. err. adjusted for 5 clusters in id)
          ------------------------------------------------------------------------------
                       |               Robust
               dep_var | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
          -------------+----------------------------------------------------------------
                Index1 |  -.0237862   .0248318    -0.96   0.392    -.0927303    .0451579
                Index2 |   .0003254   .0008764     0.37   0.729    -.0021079    .0027587
                Index3 |  -.0000999   .0008917    -0.11   0.916    -.0025757    .0023758
                Index4 |   .0022282   .0027642     0.81   0.465    -.0054463    .0099027
                 _cons |   .0033961   .0326341     0.10   0.922    -.0872108     .094003
          -------------+----------------------------------------------------------------
               sigma_u |  .02158748
               sigma_e |  .06964497
                   rho |  .08765628   (fraction of variance due to u_i)
          ------------------------------------------------------------------------------
          
          . estat vif
          estat vif not valid
          r(321);
          
          . estat vce
          
          Covariance matrix of coefficients of xtreg model
          
                  e(V) |     Index1      Index2      Index3      Index4       _cons 
          -------------+------------------------------------------------------------
                Index1 |  .00061662                                                 
                Index2 | -.00002068   7.681e-07                                     
                Index3 |  .00002026  -7.517e-07   7.951e-07                         
                Index4 |  -.0000681   2.363e-06  -2.339e-06   7.641e-06             
                 _cons |  .00062813  -.00002614   .00002428  -.00007531   .00106499
          But I am not quite sure how to interpret results from estat vce. Also given a model like mine, how will you consider variables if Index1 is my VOI? Which of the other indexes should I use in ,my model to reduce problems of Multicollinearity

          Comment


          • #6
            Neelakanda:
            1) if you go pooled OLS (a questionable first choice with panel data), observations belonging to the same panel are not independent; that's why you should use -vce-(cluster panelid)- standard errors.
            The same holds for -xtreg- if you detect heteroskedasticity and/or autocorrelatiin of the epsilon, with the relevant difference that, unlike -regress-, both -robust- and -vce(cluster idcode)- do the very same job under -xtreg-, as they both invoke cluster-robust standard errors. Conversely, the -robust- option takes only heteroskedasticity into account with -regress-;
            2) while it's true that -estat vif- is not available after -xtreg-, for the same purpose you can type:
            Code:
            estat vce, corr
            ;
            3) as far as you -xtreg,fe- code is concerned, I woudl say that:
            a) due to the very low within R_sq your model it's likely to suffer from mispecification;
            b) as sigma_e>sigma_u, the evidence of a panel-wise effect sounds unclear.
            Kind regards,
            Carlo
            (StataNow 18.5)

            Comment


            • #7
              Dear Carlo Lazzaro
              Can I have one more question in this regard? Is there a way to get stars with -estat vce, corr- so that significance at various levels can be ascertained?
              I do agree the model is misspecified but I want to start with some basic models. Thanks for the observation related to sigma_e>sigma_u. I never used to check them and I am helpful for those diagnostics

              Comment


              • #8
                Neelakanda:
                not that I know.
                That said, a correlation>0.75 may be suspect (between linear terms; conversely, it is expected that linear and squared terms of the same predictor are highly correlated).
                Kind regards,
                Carlo
                (StataNow 18.5)

                Comment


                • #9
                  However, often at presentations, people ask about multicollinearity, VIFs etc. Hence I am still doubtful about this
                  If I were asked about this at a presentation, I would respond as I have in your post and would cite Goldberger as a reference. (If you want you can find other references to this same kind of material on Google.)

                  So in this case I will run models one by one by considering collinear variables one by one and not taking them all in one go. Is that fine?
                  Probably not! I'm assuming that the Index* variables are being included as variables of interest, not just as nuisance variables whose effects must be adjusted for. So, presumably you believe that these variables are each associated with dep_var. And, as you have discovered, they are also associated with each other. Assuming that the causal direction of these relationships (if there is a causal relation) is in the direction from Index* to dep_var, then the Index* variables are each confounders of the relationships of the others with dep_var. To use only one is to doom your analysis to omitted variable bias.

                  The only genuine solution to the loss of precision associated with multicolinearity is to get a (usually much) larger data set. As Goldberger says, multicolinearity should properly be called micronumerosity.

                  Short of that, if there is some substantively meaningful way to combine the four Index* variables into a single variable that does not discard too much of the variation, then using that single variable as a proxy for the four Index* variables might produce a useful result. You often see, for example, people doing principal components analysis on the multicolinear variables and entering the components. Since principal components are orthogonal, this breaks the multicollinearity and produces precise results for the components. The problem is that the components are often meaningless in real world terms: they don't correspond to anything in the real world, so you get a precise estimation of the effect of something that doesn't exist in reality! But in the circumstances where, say, the first principal component is actually interpretable as a measure of something in the real world, then this approach may be satisfactory.

                  Comment


                  • #10
                    Dear Clyde Schechter
                    Thanks for the excellent explanation. As I recall from one of your posts, Multicollinearity can
                    1) Reverse the signs of collinear variables
                    2) None of collinear variables are individually a significant predictor but R^2 will be high
                    My main variable of interest is Index1, which is a proxy for political uncertainty (higher the value, high uncertainty). However, it can happen that Index1 is a significant predictor not because it measures political uncertainty but due to potential omitted variable bias arising from Policy uncertainty (Index2). Thus to control the impact of policy uncertainty, I can use Index2 but as it is very collinear with Index1, I fear one of the 2 problems can happen. In first case, I may have different association between dep var and Index1 owing to Index2 for which I don't have a theory and if second happens, my study is gone. Getting more data than I presently dealing with (originally) is difficult as these are secondary ones. In that case I thought either ignore Index2 so that I will have a omitted variable bias or consider both and report the results accordingly. Which is more sinful? Omitted Variable bias or Bias due to multicollinearity. I am not quite sure.
                    Thanks once again for the insightful description on Multicollinearity
                    Last edited by Neelakanda Krishna; 27 Feb 2022, 22:17.

                    Comment


                    • #11
                      Multicollinearity does not cause bias. Yes, it can reverse the sign of a coefficient, but only because the standard error becomes so large that the values with opposite signs are, in effect, indistinguishable.

                      Nevertheless, when you get a result that is in the "wrong" direction, whether due to bias or due to imprecision, that is a problem. But, as Goldberger would say, your problem is best described not as multicolinearity but as micronumerosity.

                      I hate to say this, but a data set of 70 observations with two important highly-correlated variables is just not suitable for your problem. I understand that getting more data is difficult, maybe even impossible. But I am not the first person to note that having a data set and a question plus a burning desire is not sufficient to assure that the question can be adequately answered. The options I can see are:

                      1. Proceed with the data you have and the analysis involving Index1 and Index2, and accept the possibility that your study may be inconclusive to the extent that a joint political and policy uncertainty effect may be identifiable, but it may be impossible to distinguish their separate effects.

                      2. Get a much larger data set (I realize this is likely not feasible).

                      3. Get a different data set that samples entities in such a way that Index1 and Index2 are no longer highly correlated. For example, you might draw your sample in such a way that for every entity where Index1 and Index2 are both high (or both low) you include another where one is high and the other is low. This would be a non-random sample and it would deliberately mis-represent the correlation between Index1 and Index2 that would be found in the full population of entities, but it would be better able to discriminate policy uncertainty from political uncertainty. This is a type of stratified sampling, and the mis-representativeness of the sample can be compensated for by a suitable weighting scheme. (This approach is sometimes used in epidemiology: tobacco and alcohol addiction tend to go together. In studies where it is important to be able to disentangle their effects, the subject accrual process will sometimes require that for every person who exhibits both, or neither, of these traits, we also include a person who is discordant for them, and balance the latter with the discordance going in both directions. In the resulting sample, tobacco and alcohol addiction are [nearly] independent, and their effects can be separately estimated.)

                      4. Repurpose your data to study a different question that does not require unbiased and precise estimation of both the effects of Index1 and Index2. That is, study some other variables, and either forget about Index1 and Index2, or include them only as necessary covariates to avoid omitted variable bias on some other association.

                      Comment


                      • #12
                        Dear Clyde Schechter
                        I dont have words to express my gratitude and thanks for such an excellent discourse on Multicollinearity and its consequences and some ways to get away with it (The example in Epedemiology is an eye-opener).
                        I wish you a good day and I’m so thankful for everything you bring to this forum.

                        Comment

                        Working...
                        X