R2 and Collinearity increase drastically after including all explanatrory variables and year fixed-effects (i.Year)

Jasmin Music

Join Date: Dec 2021

Posts: 18
#1

R2 and Collinearity increase drastically after including all explanatrory variables and year fixed-effects (i.Year)

29 Dec 2021, 03:55

Hi,
I have a short question. I am investigating the impact of covid 19 on school children's average academic performance in a bachelor thesis. After I have included all explanatory variables (CovidDummy, Municipalityincome for compulsory schools and the control for the share of Higher educated parents) my R2 increased drastically and the collinearity as well. Before when running the model, only with the CovidDummy, this was not an issue. Can someone explain to me, why this happens and if this is ok or should it be corrected somehow?

Thanks a lot in advance
Best Jasmin
Attached Files
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#2

29 Dec 2021, 04:00

Jasmin:
1) an increasing overall R-sq adding more predictors is not surprising at all. In addition, since you went -fe- you should consider within R_sq;
2) you do not provide details about increased collinearity;
3) as usual, you should check whether your regression is correctly specified.

Kind regards,
Carlo
(Stata 19.0)
Comment
Jasmin Music

Join Date: Dec 2021

Posts: 18
#3

29 Dec 2021, 07:09

Hallo Carlo, thanks a lot for replying so fast back. It helped me a lot.

Regarding Point 1. thanks for this information, i will present then the within R2 in my case since I am using fe
Regarding Point 2. I think I understood the reason before I had not included the "share of higher educated parents" and therefore my correlation value was very low (0,06), which indicates that this variable was a useful contribution in explaining the dependent variable, is this right?
Regarding Point 3. Can you explain to me please, what do you mean by a correctly specified regression model? This is my first research, so I am not that familiar with regression models. But I am trying my best
I am only now a bit concerned about the fact, that "Municipality income for compulsory schools" is insignificant, do you know a possible reason? Does this just mean, that this variable has no impact on the change in student academic performance during Covid-19?

"2020.Year1 omitted because of collinearity", does this cause any problems in my regression model?

Wish you a nice afternoon Carlo.

Best regards
Jasmin
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17707

29 Dec 2021, 08:22

Jasmin:
Point 2): I fail to follow your statement about correlation;
Point 3):
a) is -Municipality income for compulsory schools- cotinuous or categorical?
b) the omission you complain about may be due to dummy trap (

https://en.wikipedia.org/wiki/Dummy_variable_(statistics)

) avoidance or to perfect collinearity with fixed effect.

As far as the misspecification issue is concerned, in the following toy-example you will find an application to a panel dataset of the very same approach detailed under -linktest- (but for cross-sectional dataset) entry in Stata .pdf manual:

Code:

. use "https://www.stata-press.com/data/r17/nlswork.dta"
(National Longitudinal Survey of Young Women, 14-24 years old in 1968)

. xtreg ln_wage c.age##c.age, fe

Fixed-effects (within) regression               Number of obs     =     28,510
Group variable: idcode                          Number of groups  =      4,710

R-squared:                                      Obs per group:
     Within  = 0.1087                                         min =          1
     Between = 0.1006                                         avg =        6.1
     Overall = 0.0865                                         max =         15

                                                F(2,23798)        =    1451.88
corr(u_i, Xb) = 0.0440                          Prob > F          =     0.0000

------------------------------------------------------------------------------
     ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .0539076   .0028078    19.20   0.000     .0484041    .0594112
             |
 c.age#c.age |  -.0005973   .0000465   -12.84   0.000    -.0006885   -.0005061
             |
       _cons |    .639913   .0408906    15.65   0.000     .5597649    .7200611
-------------+----------------------------------------------------------------
     sigma_u |   .4039153
     sigma_e |  .30245467
         rho |  .64073314   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(4709, 23798) = 8.74                 Prob > F = 0.0000

. predict fitted, xb
(24 missing values generated)

. g sq_fitted=fitted^2
(24 missing values generated)

. xtreg ln_wage fitted sq_fitted , fe

Fixed-effects (within) regression               Number of obs     =     28,510
Group variable: idcode                          Number of groups  =      4,710

R-squared:                                      Obs per group:
     Within  = 0.1092                                         min =          1
     Between = 0.1033                                         avg =        6.1
     Overall = 0.0881                                         max =         15

                                                F(2,23798)        =    1457.96
corr(u_i, Xb) = 0.0467                          Prob > F          =     0.0000

------------------------------------------------------------------------------
     ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      fitted |   2.569185    .476861     5.39   0.000     1.634507    3.503863
   sq_fitted |    -.47432   .1440324    -3.29   0.001    -.7566326   -.1920074
       _cons |  -1.290258   .3930351    -3.28   0.001    -2.060631   -.5198837
-------------+----------------------------------------------------------------
     sigma_u |    .403403
     sigma_e |  .30238578
         rho |  .64025357   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(4709, 23798) = 8.72                 Prob > F = 0.0000

. test sq_fitted=0

 ( 1)  sq_fitted = 0

       F(  1, 23798) =   10.84
            Prob > F =    0.0010

.

As -sq_test- is significantly different from 0, there's evidence of misspecification.

Kind regards,
Carlo
(Stata 19.0)

Comment

Jasmin Music

Join Date: Dec 2021

Posts: 18
#5

29 Dec 2021, 14:00

Hi Carlo,

- The municipaliy income variable is a continous variable, does this mean I can not include it into my fixed effect model?
if not, can I instead create dummys to be able to include this variable in my fixed-effects model?

- thanks for information, I will run a test to see whether I deal with misspecification

- In the case its a dummy trap in the omission, can I still leave the regression like it is or do I need to change something?

Best regards

Jasmin
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#6

29 Dec 2021, 14:17

Stata deals with the dummy variable trap by excluding one of the levels of the dummy variable as a reference group.

There's no reason why continuous variables can't be included in FE models.

More to the point, I'm confused about your original regression syntax. You wrote

Code:

reg y x if treated > 0 & treated != 1

If your treatment is an indicator variable (I don't know since I can't see an example of your data), maybe this has something to do with the issue? Typically treatments (that aren't phased-in anyways) are just 0 1 variables.
2 likes
Comment
Jasmin Music

Join Date: Dec 2021

Posts: 18
#7

30 Dec 2021, 03:39

Thanks for the reply Jared, this is good to know. The treatment should exclude some specific observations.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#8

30 Dec 2021, 05:13

What does treated =1 represent here, then, and why would we want to exclude it? Jasmin Music
Comment

Announcement