Multicollinearity in Lagged Independent Variables

Clarine Caine

Join Date: Apr 2021
Posts: 6

Multicollinearity in Lagged Independent Variables

28 Apr 2021, 11:18

Hi, I'm doing a research for my thesis on the impact of environmental innovation to financial performance. Since it can take years for innovation to have impact on company's financial performance, I'm using lagged variables of the environmental innovation from t-1 up until t-5. I'm using panel data.

The problem is, there is multicollinearity in my lagged variable. I'm wondering if it is alright to ignore the multicollinearity. Are there any approaches that should be taken?

Here is the result to the multicollinearity test:

Code:

. reg ROA_w ENV_w ENV_lag1 ENV_lag2 ENV_lag3 ENV_lag4 ENV_lag5 RDI_w SIZE_w

      Source |       SS           df       MS      Number of obs   =       825
-------------+----------------------------------   F(8, 816)       =     10.18
       Model |   657.26792         8    82.15849   Prob > F        =    0.0000
    Residual |  6583.71871       816  8.06828273   R-squared       =    0.0908
-------------+----------------------------------   Adj R-squared   =    0.0819
       Total |  7240.98662       824  8.78760513   Root MSE        =    2.8405

------------------------------------------------------------------------------
       ROA_w |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       ENV_w |  -.0074454   .0105458    -0.71   0.480    -.0281456    .0132547
    ENV_lag1 |    .008536   .0137712     0.62   0.536    -.0184951    .0355672
    ENV_lag2 |  -.0115661   .0119388    -0.97   0.333    -.0350004    .0118682
    ENV_lag3 |   .0045303    .010918     0.41   0.678    -.0169003    .0259609
    ENV_lag4 |   .0092341   .0112023     0.82   0.410    -.0127546    .0312228
    ENV_lag5 |  -.0232335   .0085287    -2.72   0.007    -.0399743   -.0064927
       RDI_w |   24.61587   3.864417     6.37   0.000      17.0305    32.20124
      SIZE_w |  -9.44e-06   2.53e-06    -3.72   0.000    -.0000144   -4.46e-06
       _cons |   5.626047   .2895753    19.43   0.000     5.057647    6.194447
------------------------------------------------------------------------------

            

.
.  . vif

    Variable |       VIF       1/VIF  
-------------+----------------------
    ENV_lag1 |     12.09    0.082687
    ENV_lag2 |      9.55    0.104716
    ENV_lag4 |      9.55    0.104716
    ENV_lag3 |      8.54    0.117050
       ENV_w |      6.95    0.143930
    ENV_lag5 |      5.75    0.173855
      SIZE_w |      1.04    0.964869
       RDI_w |      1.02    0.982548
-------------+----------------------
    Mean VIF |      6.81

Thank you!

Last edited by Clarine Caine; 28 Apr 2021, 11:23.

Tags: None

Rhys Williams

Join Date: Apr 2020

Posts: 224
#2

28 Apr 2021, 13:57

Hi Clarine,

You say you have panel data but you don't seem to be best utilising that, ie it looks like you are looking all data together. Is that the best approach? Perhaps you want to consider a panel approach such as fixed or random effects.
At the very least, I would recommend using robust standard errors in your current specification (and clustered SEs if you use a panel approach).

In terms of multicollinearity, the effect is to increase the standard errors if your estimates. It won't affect the coefficient estimates. You can get around this by introducing more data, but that probably isn't an option. This could explain the high standard errors in your first 4 lags although your 5th lag is significant. To be honest, I think it's to be expect that multicollinearity is high given the fact you're including lags.
You could try to remove some earlier lags but I don't know exactly the hypothesis you are testing and what the literature does in this area.

Best Rhys
Comment
Oscar Ozfidan

Join Date: Sep 2018

Posts: 257
#3

28 Apr 2021, 20:07

please type
help xtset
to set up your data as panel data it would be something like this:

xtset panelvar timevar

then you can use timeseries operators instead of creating lagged variables yourself. see
help tsvarlist

after doing those then your estimation command would look like
xtreg ROA_w L(0/5).ENV RDI_w SIZE_w

after doing all that if collinearity is still a problem, then you must either find more data (not likely a solution) or eliminate some of the lags (most likely solution)

Last edited by Oscar Ozfidan; 28 Apr 2021, 20:09.
Comment

Clarine Caine

Join Date: Apr 2021
Posts: 6

28 Apr 2021, 21:48

Hi Rhys,

I already did the Chow, Hausman, and Breusch Pagan test. The result shows that I need to use Fixed Effect Model. I only use the regression that I post in the first post to get the VIF number.

Here is the result to my regression:

Code:

. xtreg ROA_w ENV_w ENV_lag1 ENV_lag2 ENV_lag3 ENV_lag4 ENV_lag5 RDI_w SIZE_w, fe cluster(ID)

Fixed-effects (within) regression               Number of obs     =        825
Group variable: ID                              Number of groups  =        275

R-sq:                                           Obs per group:
     within  = 0.1100                                         min =          3
     between = 0.0001                                         avg =        3.0
     overall = 0.0000                                         max =          3

                                                F(8,274)          =       4.13
corr(u_i, Xb)  = -0.8966                        Prob > F          =     0.0001

                                   (Std. Err. adjusted for 275 clusters in ID)
------------------------------------------------------------------------------
             |               Robust
       ROA_w |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       ENV_w |  -.0120233   .0087011    -1.38   0.168    -.0291529    .0051062
    ENV_lag1 |  -.0054445    .007026    -0.77   0.439    -.0192762    .0083872
    ENV_lag2 |  -.0146844    .008402    -1.75   0.082     -.031225    .0018563
    ENV_lag3 |   .0084578    .005486     1.54   0.124    -.0023422    .0192578
    ENV_lag4 |   .0221236   .0079207     2.79   0.006     .0065305    .0377167
    ENV_lag5 |  -.0273902   .0089742    -3.05   0.002    -.0450572   -.0097231
       RDI_w |  -125.8473   36.56646    -3.44   0.001    -197.8342   -53.86042
      SIZE_w |  -.0000966   .0000443    -2.18   0.030    -.0001837   -9.40e-06
       _cons |    13.8214   2.315614     5.97   0.000     9.262746    18.38006
-------------+----------------------------------------------------------------
     sigma_u |  6.0589826
     sigma_e |  1.5639711
         rho |  .93753381   (fraction of variance due to u_i)
------------------------------------------------------------------------------

The results are actually showing similar results with the literature that I'm using. I'm just not sure if I could just ignore the multicollinearity problem or not.

I don't think the lags can be eliminated because one of the main objective is to see if time lags matter in this issue.
Since you mentioned that it is expected that multicollinearity is high given the fact that there are lags, does it mean that I can just leave it as it is?

Comment

Oscar Ozfidan

Join Date: Sep 2018

Posts: 257
#5

28 Apr 2021, 22:16

You can read the following to see if you should or not.
https://statisticalhorizons.com/multicollinearity
Comment
Clarine Caine

Join Date: Apr 2021

Posts: 6
#6

28 Apr 2021, 22:35

Thanks Oscar!
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2121
#7

29 Apr 2021, 07:02

I bet you can estimate the long run effect -- that is, the sum of the coefficients -- much more precisely. Our theories are often better about long-run relationships, anyway. It's always difficult to estimate the dynamic effects, but more data helps.

I'm not a fan of even looking at the VIFs in these cases. I know I have a lot of multicollinearity. But I will say that the situation is almost certainly worse with fixed effects, because much of the variation in the explanatory variables is removed. You can see this by centering the data from the firm-specific time averages, using pooled OLS and then obtaining the VIFs. Like I said, I'm not sure what you learn will help.
Comment

Announcement

Multicollinearity in Lagged Independent Variables

Comment

Comment

Comment

Comment

Comment

Comment