Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • R2 and Collinearity increase drastically after including all explanatrory variables and year fixed-effects (i.Year)

    Hi,
    I have a short question. I am investigating the impact of covid 19 on school children's average academic performance in a bachelor thesis. After I have included all explanatory variables (CovidDummy, Municipalityincome for compulsory schools and the control for the share of Higher educated parents) my R2 increased drastically and the collinearity as well. Before when running the model, only with the CovidDummy, this was not an issue. Can someone explain to me, why this happens and if this is ok or should it be corrected somehow?

    Thanks a lot in advance
    Best Jasmin
    Attached Files

  • #2
    Jasmin:
    1) an increasing overall R-sq adding more predictors is not surprising at all. In addition, since you went -fe- you should consider within R_sq;
    2) you do not provide details about increased collinearity;
    3) as usual, you should check whether your regression is correctly specified.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      Hallo Carlo, thanks a lot for replying so fast back. It helped me a lot.

      Regarding Point 1. thanks for this information, i will present then the within R2 in my case since I am using fe
      Regarding Point 2. I think I understood the reason before I had not included the "share of higher educated parents" and therefore my correlation value was very low (0,06), which indicates that this variable was a useful contribution in explaining the dependent variable, is this right?
      Regarding Point 3. Can you explain to me please, what do you mean by a correctly specified regression model? This is my first research, so I am not that familiar with regression models. But I am trying my best
      • I am only now a bit concerned about the fact, that "Municipality income for compulsory schools" is insignificant, do you know a possible reason? Does this just mean, that this variable has no impact on the change in student academic performance during Covid-19?
      • "2020.Year1 omitted because of collinearity", does this cause any problems in my regression model?

      Wish you a nice afternoon Carlo.

      Best regards
      Jasmin



      Comment


      • #4
        Jasmin:
        Point 2): I fail to follow your statement about correlation;
        Point 3):
        a) is -Municipality income for compulsory schools- cotinuous or categorical?
        b) the omission you complain about may be due to dummy trap (
        https://en.wikipedia.org/wiki/Dummy_variable_(statistics)
        ) avoidance or to perfect collinearity with fixed effect.

        As far as the misspecification issue is concerned, in the following toy-example you will find an application to a panel dataset of the very same approach detailed under -linktest- (but for cross-sectional dataset) entry in Stata .pdf manual:
        Code:
        . use "https://www.stata-press.com/data/r17/nlswork.dta"
        (National Longitudinal Survey of Young Women, 14-24 years old in 1968)
        
        . xtreg ln_wage c.age##c.age, fe
        
        Fixed-effects (within) regression               Number of obs     =     28,510
        Group variable: idcode                          Number of groups  =      4,710
        
        R-squared:                                      Obs per group:
             Within  = 0.1087                                         min =          1
             Between = 0.1006                                         avg =        6.1
             Overall = 0.0865                                         max =         15
        
                                                        F(2,23798)        =    1451.88
        corr(u_i, Xb) = 0.0440                          Prob > F          =     0.0000
        
        ------------------------------------------------------------------------------
             ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
        -------------+----------------------------------------------------------------
                 age |   .0539076   .0028078    19.20   0.000     .0484041    .0594112
                     |
         c.age#c.age |  -.0005973   .0000465   -12.84   0.000    -.0006885   -.0005061
                     |
               _cons |    .639913   .0408906    15.65   0.000     .5597649    .7200611
        -------------+----------------------------------------------------------------
             sigma_u |   .4039153
             sigma_e |  .30245467
                 rho |  .64073314   (fraction of variance due to u_i)
        ------------------------------------------------------------------------------
        F test that all u_i=0: F(4709, 23798) = 8.74                 Prob > F = 0.0000
        
        . predict fitted, xb
        (24 missing values generated)
        
        . g sq_fitted=fitted^2
        (24 missing values generated)
        
        . xtreg ln_wage fitted sq_fitted , fe
        
        Fixed-effects (within) regression               Number of obs     =     28,510
        Group variable: idcode                          Number of groups  =      4,710
        
        R-squared:                                      Obs per group:
             Within  = 0.1092                                         min =          1
             Between = 0.1033                                         avg =        6.1
             Overall = 0.0881                                         max =         15
        
                                                        F(2,23798)        =    1457.96
        corr(u_i, Xb) = 0.0467                          Prob > F          =     0.0000
        
        ------------------------------------------------------------------------------
             ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
        -------------+----------------------------------------------------------------
              fitted |   2.569185    .476861     5.39   0.000     1.634507    3.503863
           sq_fitted |    -.47432   .1440324    -3.29   0.001    -.7566326   -.1920074
               _cons |  -1.290258   .3930351    -3.28   0.001    -2.060631   -.5198837
        -------------+----------------------------------------------------------------
             sigma_u |    .403403
             sigma_e |  .30238578
                 rho |  .64025357   (fraction of variance due to u_i)
        ------------------------------------------------------------------------------
        F test that all u_i=0: F(4709, 23798) = 8.72                 Prob > F = 0.0000
        
        . test sq_fitted=0
        
         ( 1)  sq_fitted = 0
        
               F(  1, 23798) =   10.84
                    Prob > F =    0.0010
        
        .
        As -sq_test- is significantly different from 0, there's evidence of misspecification.
        Kind regards,
        Carlo
        (StataNow 18.5)

        Comment


        • #5
          Hi Carlo,

          - The municipaliy income variable is a continous variable, does this mean I can not include it into my fixed effect model?
          if not, can I instead create dummys to be able to include this variable in my fixed-effects model?

          - thanks for information, I will run a test to see whether I deal with misspecification

          - In the case its a dummy trap in the omission, can I still leave the regression like it is or do I need to change something?

          Best regards

          Jasmin

          Comment


          • #6
            Stata deals with the dummy variable trap by excluding one of the levels of the dummy variable as a reference group.

            There's no reason why continuous variables can't be included in FE models.


            More to the point, I'm confused about your original regression syntax. You wrote
            Code:
            reg y x if treated > 0 & treated != 1
            If your treatment is an indicator variable (I don't know since I can't see an example of your data), maybe this has something to do with the issue? Typically treatments (that aren't phased-in anyways) are just 0 1 variables.

            Comment


            • #7
              Thanks for the reply Jared, this is good to know. The treatment should exclude some specific observations.

              Comment


              • #8
                What does treated =1 represent here, then, and why would we want to exclude it? Jasmin Music

                Comment

                Working...
                X