R squared of fixed effects model too high

Joan Stein

Join Date: Jul 2020

Posts: 18
#1

R squared of fixed effects model too high

02 Aug 2020, 17:11

Hello everybody,

thank you for your time and effort in advance! I really appreciate it!
I have some big issues with my two-way fixed effects model that I can´t solve and I am getting very frustrated to be honest. I have an unbalanced panel data set of N=33 and T=7 and Iam using the xtreg,fe command in Stata 16.

My problem at this point is,that the within R Squared is always greater than 90% and sometimes even greater than 95%. This seems way too good and I cannot believe that this is a sign of a good model. I rather would state that this shows that my model is extremely bad.
After looking for possible causes for such an measure I concluded that this must be the result of a spurious regression.
Causes for spurious regression according to my knowledge are: Multicollinearity (but all VIF are smaller than 5); Overfitting (but when I discard multiple independent variables, dummies or control variables the R Squares still stays around 90%, identical functional forms of DV and IV (not the case since only 2 IVs are logarithmized), Chance correlation (I really hope this isn´t the case) and time trends respectively non-stationarity

With regard to time trends i originally assumed that the time dummies account for those trends and especially in a N > T dataset I do not have to worry too much about such a thing. What I found about this topic yet is contradictory.

My Questions at this point are:
1) Do you have other ideas besides my guess about spurious regression?
2) Do you think it is necessary to check for stationarity and cointegration and if yes, would it be enough to take the first differences of the dependent variable (if non stationary), of all independent variables that are non-stationary or do I have to transform all variables in such a case? (so basically: is it statistically valid to just transform some variables? e.g. the ones I do not want to interpret)

Code:

xtreg lnGrossLoans lnTotalAssets LiquidityRatio EquityRatio lnNPLratio Depositratio ROAE2 NetFeesCommissionsNI BaselIIIDummy NPLDummy LDDummy MandatoryReservesDep RealGDPgrowthofChina ShanghaiCompositeStockMarket i.Year i.ReportingStand, fe cluster(Entity) vce(bootstrap, rep(50) seed(20))

Greetings and thanks again!
Joan

P.S. I do not understand yet, how to show my regression output since screenshots are not allowed. Otherwise I would provide it instantly.

Last edited by Joan Stein; 02 Aug 2020, 17:29.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

02 Aug 2020, 18:04

do not understand yet, how to show my regression output since screenshots are not allowed. Otherwise I would provide it instant

You present it by copying the command and its output from Stata's Results window and pasting that into your post surrounding it with the same CODE delimiters you used to present your xtreg command in post #1.
1 like
Comment

Joan Stein

Join Date: Jul 2020
Posts: 18

02 Aug 2020, 18:21

Good evening and thank you very much!

I should probably add two things:
1) My macro economic variables have high collinearity with the year dummies. I decided to include 2 macro economic variables and accept the omission of two year dummies. When discarding all macro economic control variables, all time dummies can be kept in the model but the R^2 remains unchanged so this does not seem to be the solution
2) There is a strong correlation between TotalAssets and GrossLoans
When discarding TotalAssets my R^2 decreases to 83% but thats it. The R^2 remains very high and deleting other variables does not change the measure in the following. So this does not seem to be the root of my problem.

Code:

xtreg lnGrossLoans lnTotalAssets LiquidityRatio EquityRatio lnNPLratio Depositratio ROAE2 i.Owner i.Ownership NetFeesCommissionsNI
>  BaselIIIDummy NPLDummy LDDummy MandatoryReservesDep RealGDPgrowthofChina ShanghaiCompositeStockMarket i.Year i.ReportingStand, fe
>  noomitted vce(bootstrap, rep(50) seed(20))
note: 2018.Year omitted because of collinearity
note: 2019.Year omitted because of collinearity
(running xtreg on estimation sample)

Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50

Fixed-effects (within) regression               Number of obs     =        219
Group variable: Entity                          Number of groups  =         33

R-sq:                                           Obs per group:
     within  = 0.9421                                         min =          3
     between = 0.9954                                         avg =        6.6
     overall = 0.9938                                         max =          7

                                                Wald chi2(17)     =    1564.18
corr(u_i, Xb)  = -0.1666                        Prob > chi2       =     0.0000

                                                 (Replications based on 33 clusters in Entity)
----------------------------------------------------------------------------------------------
                             |   Observed   Bootstrap                         Normal-based
                lnGrossLoans |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------------------+----------------------------------------------------------------
               lnTotalAssets |   1.038927   .1158523     8.97   0.000     .8118606    1.265993
              LiquidityRatio |  -.2472995   .2416455    -1.02   0.306     -.720916    .2263169
                 EquityRatio |   2.416022   1.692168     1.43   0.153    -.9005661     5.73261
                  lnNPLratio |   .1660521   .0545376     3.04   0.002     .0591605    .2729438
                Depositratio |   .5765198   .2649262     2.18   0.030     .0572741    1.095766
                       ROAE2 |  -.0002293   .0001782    -1.29   0.198    -.0005785    .0001199
                             |
        NetFeesCommissionsNI |    .039496    .023858     1.66   0.098    -.0072647    .0862568
               BaselIIIDummy |  -.0240104   .0182694    -1.31   0.189    -.0598178     .011797
                    NPLDummy |  -.0656721   .0247447    -2.65   0.008    -.1141709   -.0171733
                     LDDummy |   .0254201   .0212323     1.20   0.231    -.0161944    .0670347
        MandatoryReservesDep |   4.570457   1.023753     4.46   0.000     2.563938    6.576975
        RealGDPgrowthofChina |  -.1114391   .0571882    -1.95   0.051     -.223526    .0006477
ShanghaiCompositeStockMarket |   .0004287   .0005781     0.74   0.458    -.0007043    .0015617
                             |
                        Year |
                       2014  |  -.1508985   .0313104    -4.82   0.000    -.2122657   -.0895313
                       2015  |  -.1737876   .0227415    -7.64   0.000    -.2183601   -.1292151
                       2016  |  -.1895637   .0294689    -6.43   0.000    -.2473216   -.1318058
                       2017  |   -.123125    .029259    -4.21   0.000    -.1804717   -.0657784
                             |
                       _cons |  -1.497066   1.984147    -0.75   0.451    -5.385923     2.39179
-----------------------------+----------------------------------------------------------------
                     sigma_u |  .10665451
                     sigma_e |  .07111294
                         rho |  .69224815   (fraction of variance due to u_i)
----------------------------------------------------------------------------------------------

Code:

. collin GL TA LR ROAE2 ER lnNPLratio DR MR GDP Infl NFCINI Int Stock BIII NPL LDD
(obs=219)

  Collinearity Diagnostics

                        SQRT                   R-
  Variable      VIF     VIF    Tolerance    Squared
----------------------------------------------------
        GL    249.36   15.79    0.0040      0.9960
        TA    221.79   14.89    0.0045      0.9955
        LR      2.11    1.45    0.4740      0.5260
     ROAE2      3.06    1.75    0.3265      0.6735
        ER      2.81    1.68    0.3555      0.6445
lnNPLratio      5.04    2.25    0.1984      0.8016
        DR      1.32    1.15    0.7556      0.2444
        MR      3.12    1.77    0.3203      0.6797
       GDP      7.18    2.68    0.1394      0.8606
      Infl      1.58    1.26    0.6321      0.3679
    NFCINI      1.63    1.28    0.6136      0.3864
       Int      3.77    1.94    0.2653      0.7347
     Stock      2.24    1.50    0.4459      0.5541
      BIII      1.44    1.20    0.6926      0.3074
       NPL      1.81    1.35    0.5520      0.4480
       LDD      1.33    1.15    0.7534      0.2466
----------------------------------------------------

Last edited by Joan Stein; 02 Aug 2020, 18:37.

Comment

Prateek Bedi

Join Date: Sep 2018

Posts: 199
#4

02 Aug 2020, 23:40

Hi Joan,

With whatever little knowledge I have of FE estimation, I have the following points to make.

1. With T=7, I do not think there is any need to check for stationarity.
2. If cross-section invariant variables (such as macroeconomic variables in your case) are of interest to you and you need their coefficients, do not include time dummies. On the other hand, if such variables are not of interest to you, you should not include any of them in your model. Rather, you should use just use time dummies.
3. With regard to this high R-squared, I suggest you to have a closer look at your dataset. Since you have around 200 firm-year observations only, please scan your data manually to check for potential errors like duplication of values etc.
4. Try excluding one independent variable each at a time and run the model. May be, you can get some clue regarding the cause of this high R-Squared.
5. Try running a pooled OLS model, Fama-Macbeth model and between-effects model and see if this high R-Squared still persists.

Having said that, do look at the respective literature of your topic.. Possibly, such high R-Squared values may be common!
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17653
#5

03 Aug 2020, 00:56

Joan:
it may well be that -GL- and -TA- cannot live together in the same model.
Two asides follow:
- it's the within R-sq that you should look at after -xtreg,fe-:
- I fail to get why -bootstrap-ping standard errors instead of clustering them on -panelid-.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Joan Stein

Join Date: Jul 2020

Posts: 18
#6

03 Aug 2020, 01:56

Good morning to both of you! Thank you very much for your comments!

@Prateek
1) I also thought, that this might be not necessary, which is good to hear because such transformations would change just everything...
2) Since a macro economic variable is a bit different from a Year control variable and covers the whole examination period,I decided to take kind of a middle way and implement 2 of them. Due to the strong collinearity that triggers the omission I expect the macro variable to control for much more than only for the one purpose and still control sufficiently for the year effect. To be honest it´s probably more like a window dressing. A regression without macro economic control variables looks wrong to me and probably also to others.
3/5) I will do this today and hopefully I will see better results thank you!
4) Unfortunately I already tried this and the only ovewhelming impact variable I found was TA

@Carlo
1) I was already afraid that this correlation might be a problem. Since it is a important (even though not the most important) variable this hurts a bit and I will look for equivalents. A transformation of TA will harm the interpretation of the variable so this is unfortunately no option.
2) Unfortunately the within and the overall R squared are both heightened. I also took a look at the adjusted within R^2 by looking at the ereturn list but its around 93% so there is no improvement either.
3) Due to the presence of heteroscedasticity and autocorrelation I clustered the standard errors on -panelid-. Since I have a very small unbalanced dataset with only a few time periods (clustering seems to require that N / T -> oo) I feared that my standard errors might be downward biased, so I used the bootstrapping to adjust for this small sample bias.
-> Oh now I see that I might accidently deleted the cluster(Entity) part in the code of the regression output. Of course you are right, I cluster my SE on -panelid- and use the bootstrapping only as additional robustness provision. The code of my original post is complete and as you can see, there I clustered my SE on -panelid-

Last edited by Joan Stein; 03 Aug 2020, 02:03.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17653
#7

03 Aug 2020, 02:06

Joan:
thanks for claryfing.
1) and 2) I see the issue with your model. However, trying to sell such sky-rocketing Rsqs can be problematic (because most readers/reviewers would find them inflated).
3) Usually, -bootstrap- is an option when clusters are too limited (which does not seem to be your case) or when clustering is not supported by a given Stata regression command (which, again, does not seem to be your case). In presence of heteroskedasticity and/or autocorrelation, I would cluster my SEs on -panelid- and see if they substantively differ from their -bootstrap- counterpart.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Joan Stein

Join Date: Jul 2020

Posts: 18
#8

03 Aug 2020, 02:24

@Carlo
Thank you for investing your time in a problem that is solely mine!

Yes that is exactly my problem. Even though it even might be no spurious regression, it will looks like one for a evaluating person.
My emergency solution might be to drop TA to push the R^2 below 90% but in my opinion an R^2 of 85% is still way too much. Something between 50 and 60% would be reasonable in my opinion.

Oh, so if I understand you correctly, I use the bootstrapping SE as a robustness check but then fall back on my clustered SE?
Comment
ASO ABDULLAH

Join Date: Aug 2020

Posts: 1
#9

03 Aug 2020, 18:44

Dear Professors,

I am a PhD student, I am in a difficult position and I face uphill and struggle to clamp.

I have an issue of multicollinearity among my main independent variables (sustainability committee) and if I dropped it will harm my research therefore, I am seeking advice for an alternative solution. I tested VIF and also its greater than 10. So I applied log transformation then I run regression this time I experienced more multicollinearity because my sustainability committee is dummy variable (1 or 0) thus when I applied log 1 become ZERO, therefore, under the column of sustainability committee there is no value greater than zero. Also, I tired Heteroskedasticity in OLS and Breusch-Pagan / Cook-Weisberg test for heteroskedasticity still my P. value is very strong.
Attached Files

Reg - Dropping Sus Comm - 3rd Aug.rtf (1.07 MB, 1 view)
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2109
#10

03 Aug 2020, 21:54

Joan: Since the within R-squared is so high, I suspect the time dummies are explaining much of the variation in the dependent variable. And I agree with Prateek that it is best to leave out the macro variables, as putting in time year dummies is completely flexible and less arbitrary.

Many years ago I published a paper in economics letters on computing R-squareds by removing trends. It never caught on, but I think it could help you. I would detrend your y variable and then use the detrended version. Hopefully the within R-squared using the detrended variable will be reasonable. The idea is to only count variation in y after you've netted out both sets of fixed effects. I might even remove the cross-sectional fixed effects, too.

Generic code:

Code:

reg y x1 ... xK i.ID i.year predict y_dt, resid xtreg y_dt x1 x2 ... i.year, fe vce cluster(ID)

You have a pretty small N -- kind of on the border of clustering -- but it's not clear bootstrapping is better.

JW
4 likes
Comment
Joan Stein

Join Date: Jul 2020

Posts: 18
#11

04 Aug 2020, 03:43

Dear Professor Wooldridge,

thank you very much for your comments and your help!
I immediately followed your instructions, dropped the macro economic variables and removed the trend in the dependent variable.
Now indeed my within R^2 only equals 17.7% which seems to be a much more realistic magnitude.
Although that was to be expected, almost all of my explanatory variables became insignificant, so I have to take a closer look at the new regression.

I also will overthink the bootstrapping and at least will expand my argumentation about the use of it.
Thanks again to everyone.

@Aso
I am not sure whether a comment from me is helpful for your problem, but I will just say what I thought when I looked at your word document.
What is striking for me, when I look at your regression beside the inflated VIF measures, is the scaling difference between your main IV and all other variables in the model. The coefficient of sustainability committee seems to be 1000 times higher than all other coefficients, although the multicollinearity problem covers way more variables in the model.
And I agree with your point, that taking the logarithm of a dummy variable is not a possible solution.
A possible solution might be, to look at a correlation matrix of your independent variables and transform the variables that are highly correlated with your dummy variable or to use other proxies that can be used as equivalents instead of those highly correlated variables.

Last edited by Joan Stein; 04 Aug 2020, 04:02.
Comment

Announcement