Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Adjusted R-squared still high after deleting control variables

    Hello Statalist!


    Currently I run a regression and wanted to check if my variables explain a lot of variation in the dependent variable.

    To do this I once ran the regression with control variables and once without. The adjusted R-squared is only larger by one percent if I include control variables. All of them are variables identified as important by previous literature.

    Now I am hesitant to report this. Does this mean something is wrong with my model?

    I also tried deleting my main variables and just include the controls, and the adjusted R-squared is still at 89 percent.

    I included time and individual fixed effects, clustered for individual.

    Cross-posted here:
    https://stats.stackexchange.com/ques...trol-variables.

    Code:
    . reghdfe Y1 x1 x2 x3 c1 c2 c3 c4 c5 c6 c7 c8 c9, absorb(FIRM Year) cluster(FIRM)
    (dropped 8 singleton observations)
    (MWFE estimator converged in 6 iterations)
    
    HDFE Linear regression                            Number of obs   =        967
    Absorbing 2 HDFE groups                           F(  12,    215) =       4.20
    Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                      R-squared       =     0.8595
                                                      Adj R-squared   =     0.8151
                                                      Within R-sq.    =     0.0566
    Number of clusters (FIRM)    =        216         Root MSE        =     0.1304
    
                                     (Std. err. adjusted for 216 clusters in FIRM)
    ------------------------------------------------------------------------------
                 |               Robust
              Y1 | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
              x1 |   .0274918   .0196236     1.40   0.163    -.0111875    .0661712
              x2 |  -.0207942   .0100225    -2.07   0.039    -.0405492   -.0010391
              x3 |  -.0019367    .000597    -3.24   0.001    -.0031134     -.00076
              c1 |   .1019985    .058054     1.76   0.080    -.0124293    .2164264
              c2 |   -.016942   .0090505    -1.87   0.063     -.034781    .0008971
              c3 |  -.0003009   .0058301    -0.05   0.959    -.0117923    .0111905
              c4 |   .0068474   .0265863     0.26   0.797    -.0455557    .0592505
              c5 |   .0363782   .0308575     1.18   0.240    -.0244438    .0972001
              c6 |  -.0004864   .0009786    -0.50   0.620    -.0024153    .0014426
              c7 |  -.1023495    .089382    -1.15   0.253    -.2785266    .0738277
              c8 |   .0001882   .0000738     2.55   0.012     .0000426    .0003337
              c9 |   .0066574   .0023477     2.84   0.005     .0020299    .0112849
           _cons |  -3.505176   1.365308    -2.57   0.011     -6.19628   -.8140728
    ------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    -----------------------------------------------------+
     Absorbed FE | Categories  - Redundant  = Num. Coefs |
    -------------+---------------------------------------|
            FIRM |       216         216           0    *|
            Year |         5           0           5     |
    -----------------------------------------------------+
    * = FE nested within cluster; treated as redundant for DoF computation
    Similarily, once without the controls:

    Code:
    . reghdfe Y1 x1 x2 x3, absorb(FIRM Year) cluster(FIRM)
    (dropped 8 singleton observations)
    (MWFE estimator converged in 6 iterations)
    
    HDFE Linear regression                            Number of obs   =        967
    Absorbing 2 HDFE groups                           F(   3,    215) =       5.28
    Statistics robust to heteroskedasticity           Prob > F        =     0.0016
                                                      R-squared       =     0.8549
                                                      Adj R-squared   =     0.8114
                                                      Within R-sq.    =     0.0262
    Number of clusters (FIRM)    =        216         Root MSE        =     0.1317
    
                                     (Std. err. adjusted for 216 clusters in FIRM)
    ------------------------------------------------------------------------------
                 |               Robust
              Y1 | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
              x1 |   .0207614   .0198057     1.05   0.296    -.0182767    .0597995
              x2 |  -.0213535   .0101183    -2.11   0.036    -.0412973   -.0014098
              x3 |  -.0019855   .0005953    -3.34   0.001    -.0031589   -.0008122
           _cons |  -.9862518   .0393388   -25.07   0.000    -1.063791   -.9087126
    ------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    -----------------------------------------------------+
     Absorbed FE | Categories  - Redundant  = Num. Coefs |
    -------------+---------------------------------------|
            FIRM |       216         216           0    *|
            Year |         5           0           5     |
    -----------------------------------------------------+
    * = FE nested within cluster; treated as redundant for DoF computation
    And once without the other IV:

    ​​​​​​​
    Code:
    . reghdfe Y1 c1 c2 c3 c4 c5 c6 c7 c8 c9, absorb(FIRM Year) cluster(FIRM)
    (dropped 8 singleton observations)
    (MWFE estimator converged in 6 iterations)
    
    HDFE Linear regression                            Number of obs   =        967
    Absorbing 2 HDFE groups                           F(   9,    215) =       2.89
    Statistics robust to heteroskedasticity           Prob > F        =     0.0030
                                                      R-squared       =     0.8557
                                                      Adj R-squared   =     0.8109
                                                      Within R-sq.    =     0.0314
    Number of clusters (FIRM)    =        216         Root MSE        =     0.1319
    
                                     (Std. err. adjusted for 216 clusters in FIRM)
    ------------------------------------------------------------------------------
                 |               Robust
              Y1 | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
              c1 |   .1182841   .0640627     1.85   0.066    -.0079873    .2445556
              c2 |  -.0174848   .0088881    -1.97   0.050    -.0350038    .0000342
              c3 |   -.000514   .0058961    -0.09   0.931    -.0121356    .0111076
              c4 |  -.0000784   .0279881    -0.00   0.998    -.0552447    .0550879
              c5 |   .0416486   .0315728     1.32   0.189    -.0205832    .1038804
              c6 |  -.0008959    .000984    -0.91   0.364    -.0028354    .0010436
              c7 |    -.09267   .0946564    -0.98   0.329    -.2792434    .0939033
              c8 |   .0002231   .0000741     3.01   0.003      .000077    .0003691
              c9 |   .0060863    .002376     2.56   0.011      .001403    .0107696
           _cons |  -3.877952   1.484846    -2.61   0.010    -6.804671   -.9512319
    ------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    -----------------------------------------------------+
     Absorbed FE | Categories  - Redundant  = Num. Coefs |
    -------------+---------------------------------------|
            FIRM |       216         216           0    *|
            Year |         5           0           5     |
    -----------------------------------------------------+
    * = FE nested within cluster; treated as redundant for DoF computation
    Last edited by Luca Haseney; 11 Jan 2023, 07:37.

  • #2
    Luca:
    why not posting what you typed and what Stata gave you back (as per FAQ)? Thanks.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Dear Mr. Lazzaro, sure. My bad!
      Edited in the first post.

      Comment


      • #4
        Hi Luca
        So, the reason you have a high R2 is that its also accounts the absorbed effects!. They alone are probably explaining 80% of the variation
        What is more relevant here is your within R2. which is only 5.6% using all controls. but as low as 2.6% when looking at X's only.
        HTH

        Comment


        • #5
          Luca:
          1) I do share Fernando's concern about the too low within R-sq, that calls for double-checking the functional form of the regressand;
          2) just out of curiosity: what is the gain in going -reghfde- rather than -xtreg,fe- with your panel data;
          3) please call me Carlo, as all on (and many more off) this forum do. Thanks.
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            Originally posted by FernandoRios View Post
            Hi Luca
            So, the reason you have a high R2 is that its also accounts the absorbed effects!. They alone are probably explaining 80% of the variation
            What is more relevant here is your within R2. which is only 5.6% using all controls. but as low as 2.6% when looking at X's only.
            HTH
            Dear Fernando, yes, this helps. Thank you very much!

            Dear Carlo,

            1) Is there a systematic way to check the functional form?
            I oriented at previous literature and applied natural logarithms on the respective variables. The Y1 measure is logarithmic, as well as c4 and c5. The other variables are binary or decimals (i.e. a ratio of some kind).

            2) Because I explicitly wanted to include time and firm fixed effects to model unobserved firm data and time effects that affect all firms equally.

            3) Sure!

            Comment


            • #7
              Luca:
              1) yes (drawing heavily on -linktest-). The toy-example consider -re- but works with -fe-either:
              Code:
              . use "https://www.stata-press.com/data/r17/nlswork.dta"
              (National Longitudinal Survey of Young Women, 14-24 years old in 1968)
              
              . xtreg ln_wage i.race i.nev_mar, re vce(cluster idcode)
              
              Random-effects GLS regression                   Number of obs     =     28,518
              Group variable: idcode                          Number of groups  =      4,711
              
              R-squared:                                      Obs per group:
                   Within  = 0.0263                                         min =          1
                   Between = 0.0121                                         avg =        6.1
                   Overall = 0.0145                                         max =         15
              
                                                              Wald chi2(3)      =     429.57
              corr(u_i, X) = 0 (assumed)                      Prob > chi2       =     0.0000
              
                                           (Std. err. adjusted for 4,711 clusters in idcode)
              ------------------------------------------------------------------------------
                           |               Robust
                   ln_wage | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
              -------------+----------------------------------------------------------------
                      race |
                    Black  |   -.110084     .01332    -8.26   0.000    -.1361908   -.0839772
                    Other  |   .1165283   .0666152     1.75   0.080     -.014035    .2470917
                           |
                 1.nev_mar |  -.1611142   .0087208   -18.47   0.000    -.1782066   -.1440217
                     _cons |    1.72454   .0074549   231.33   0.000     1.709929    1.739152
              -------------+----------------------------------------------------------------
                   sigma_u |  .38311279
                   sigma_e |   .3159974
                       rho |  .59512448   (fraction of variance due to u_i)
              ------------------------------------------------------------------------------
              
              . predict fitted, xb
              
              
              . g sq_fitted=fitted^2
              
              
              . xtreg ln_wage fitted sq_fitted , re vce(cluster idcode)
              
              Random-effects GLS regression                   Number of obs     =     28,518
              Group variable: idcode                          Number of groups  =      4,711
              
              R-squared:                                      Obs per group:
                   Within  = 0.0263                                         min =          1
                   Between = 0.0120                                         avg =        6.1
                   Overall = 0.0145                                         max =         15
              
                                                              Wald chi2(2)      =     421.17
              corr(u_i, X) = 0 (assumed)                      Prob > chi2       =     0.0000
              
                                           (Std. err. adjusted for 4,711 clusters in idcode)
              ------------------------------------------------------------------------------
                           |               Robust
                   ln_wage | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
              -------------+----------------------------------------------------------------
                    fitted |   1.624269   1.588631     1.02   0.307     -1.48939    4.737928
                 sq_fitted |  -.1929673   .4921703    -0.39   0.695    -1.157603    .7716688
                     _cons |   -.503038   1.278528    -0.39   0.694    -3.008906     2.00283
              -------------+----------------------------------------------------------------
                   sigma_u |   .3847733
                   sigma_e |  .31599437
                       rho |  .59721155   (fraction of variance due to u_i)
              ------------------------------------------------------------------------------
              
              .
              As -sq_fitted- doe not reach statistical significance, there's no evidence of model misspecification;
              2) why not going:
              Code:
              xtset firm year
              xtreg depvar indepvars i.year, fe vce(cluster panelid)
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                Dear Carlo, thank you so much for your suggestions. I really appreciate it.

                I applied the code for my sample:

                Code:
                xtreg  Y1 x1 x2 x3 c1 c2 c3 c4 c5 c6 c7 c8 c9 i.Year, fe vce(cluster FIRM)
                predict fitted, xb
                g sq_fitted=fitted^2
                xtreg Y1 fitted sq_fitted, fe vce(cluster FIRM)
                
                Fixed-effects (within) regression               Number of obs     =        975
                Group variable: FIRM                            Number of groups  =        224
                
                R-squared:                                      Obs per group:
                     Within  = 0.4556                                         min =          1
                     Between = 0.0501                                         avg =        4.4
                     Overall = 0.0236                                         max =          5
                
                                                                F(2,223)          =     118.62
                corr(u_i, Xb) = -0.9464                         Prob > F          =     0.0000
                
                                                 (Std. err. adjusted for 224 clusters in FIRM)
                ------------------------------------------------------------------------------
                             |               Robust
                          Y1 | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                -------------+----------------------------------------------------------------
                      fitted |    1.09574   .1140394     9.61   0.000     .8710069    1.320472
                   sq_fitted |   .0454518   .0392435     1.16   0.248    -.0318836    .1227873
                       _cons |   .0205962    .078555     0.26   0.793    -.1342089    .1754013
                -------------+----------------------------------------------------------------
                     sigma_u |  .91800801
                     sigma_e |  .12883184
                         rho |  .98068551   (fraction of variance due to u_i)
                ------------------------------------------------------------------------------
                It points to no model misspecification.

                I also referenced to Mr. Wooldridge's post to check for non-linearity:

                Code:
                xtreg Y1 x1 x2 x3 c1 c2 c3 c4 c5 c6 c7 c8 c9 i.Year, fe vce(cluster FIRM)
                predict xbhat, xb
                gen xbhatsq = xbhat^2
                gen xbhatcu = xbhat^2
                xtreg Y1 x1 x2 x3 c1 c2 c3 c4 c5 c6 c7 c8 c9 i.Year xbhatsq xbhatcu, fe vce(cluster FIRM)
                test xbhatsq xbhatcu
                
                Fixed-effects (within) regression               Number of obs     =        975
                Group variable: FIRM                            Number of groups  =        224
                
                R-squared:                                      Obs per group:
                     Within  = 0.4559                                         min =          1
                     Between = 0.0511                                         avg =        4.4
                     Overall = 0.0218                                         max =          5
                
                                                                F(17,223)         =      16.97
                corr(u_i, Xb) = -0.9299                         Prob > F          =     0.0000
                
                                                 (Std. err. adjusted for 224 clusters in FIRM)
                ------------------------------------------------------------------------------
                             |               Robust
                          Y1 | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                -------------+----------------------------------------------------------------
                          x1 |   .0300244   .0199219     1.51   0.133    -.0092348    .0692836
                          x2 |  -.0224521    .010106    -2.22   0.027    -.0423676   -.0025365
                          x3 |  -.0021353   .0006035    -3.54   0.000    -.0033246    -.000946
                          c1 |   .0982184   .0552264     1.78   0.077     -.010614    .2070508
                          c2 |   -.017616   .0089341    -1.97   0.050    -.0352221   -9.93e-06
                          c3 |  -.0003486   .0058461    -0.06   0.952    -.0118694    .0111721
                          c4 |   .0147686   .0276555     0.53   0.594    -.0397309    .0692681
                          c5 |   .0381699   .0311407     1.23   0.222    -.0231978    .0995376
                          c6 |  -.0003709   .0009665    -0.38   0.702    -.0022755    .0015338
                          c7 |  -.1128985   .0909374    -1.24   0.216    -.2921051    .0663081
                          c8 |   .0001861   .0000739     2.52   0.013     .0000404    .0003317
                          c9 |   .0077529   .0025723     3.01   0.003     .0026837     .012822
                             |
                        Year |
                       2018  |  -.0554002   .0574869    -0.96   0.336    -.1686873    .0578868
                       2019  |  -.0811843    .111925    -0.73   0.469    -.3017502    .1393817
                       2020  |  -.0204323   .1680655    -0.12   0.903     -.351632    .3107675
                       2021  |  -.0867472    .224055    -0.39   0.699    -.5282832    .3547889
                             |
                     xbhatsq |   .0498728   .0411951     1.21   0.227    -.0313087    .1310544
                     xbhatcu |          0  (omitted)
                       _cons |  -3.615348   1.222262    -2.96   0.003    -6.024008   -1.206687
                -------------+----------------------------------------------------------------
                     sigma_u |  .80376154
                     sigma_e |  .13011625
                         rho |  .97446276   (fraction of variance due to u_i)
                ------------------------------------------------------------------------------
                
                . test xbhatsq xbhatcu
                
                 ( 1)  xbhatsq = 0
                 ( 2)  o.xbhatcu = 0
                       Constraint 2 dropped
                
                       F(  1,   223) =    1.47
                            Prob > F =    0.2273
                I really have trouble interpreting it. I think it means I can not reject the H0 that the model is correctly specified.

                Which in turn again means that the functional form is correct. Yet, the problem with the low r squared persists.
                Regarding another thread where you responded, this test is equivalent/similair to the RESET test, so I should be happy with my model and go on?
                How else can I explain for this low R-squared? Is it that I simply omitted variables? This would be interesting, since I added all the controls which are suggested by literature.
                Last edited by Luca Haseney; 11 Jan 2023, 10:51.

                Comment


                • #9
                  the fitted/sq_fitted is a joint test, which is what RESET is doing.

                  Comment


                  • #10
                    Luca:
                    the r_sq you should check in -fe- model is the within one that, in your case, is not that low (0.4556).
                    Therefore, you should be happy with your model and move on.
                    Last edited by Carlo Lazzaro; 12 Jan 2023, 00:54.
                    Kind regards,
                    Carlo
                    (Stata 19.0)

                    Comment


                    • #11
                      Dear Carlo,

                      thank you very much!

                      I think I also read somewhere on the forum that the within R squared measure obtained by the reghdfe is misspecified anyways for fixed effects models.

                      Comment

                      Working...
                      X