Sample split or interactions in fixed effects regression

Wouter Wakker

Join Date: Nov 2018
Posts: 621

Sample split or interactions in fixed effects regression

10 Apr 2019, 06:53

Dear Stata community,

I'm currently analyzing whether capital controls affect economic development of countries in different ways for different levels of income. To differentiate countries I use the World Bank categorization (low, lower middle, upper middle, high income). I want to capture year specific effects and country effects so I use fixed effects.

The data includes 100 countries and 22 years and looks like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long country2 int year double ka float loggdppc long incgroup_th
1 1995   1 9.137277 2
1 1996 .75 9.160067 2
1 1997   1 9.154971 2
1 1998 .75 9.189805 2
1 1999 .75 9.207232 2
end
label values country2 country2
label def country2 1 "Algeria", modify
label values incgroup_th incomegroup5
label def incomegroup5 2 "Lower Middle Income", modify

I thought of running the regression in two ways: first, by including interaction effects like this:

Code:

xtreg loggdppc c.L.ka##I.incgroup_th I.year, fe

Second, by splitting the sample like this:

Code:

xtreg loggdppc L.ka I.year if incgroup_th==1, fe
xtreg loggdppc L.ka I.year if incgroup_th==2, fe
xtreg loggdppc L.ka I.year if incgroup_th==3, fe
xtreg loggdppc L.ka I.year if incgroup_th==4, fe

However, when I run these regressions I get very different results, which I don't understand. Here are the results of the first regression:

Code:

. xtreg loggdppc c.L.ka##I.incgroup_th I.year, fe

Fixed-effects (within) regression               Number of obs     =      2,094
Group variable: country2                        Number of groups  =        100

R-sq:                                           Obs per group:
     within  = 0.6439                                         min =         17
     between = 0.9173                                         avg =       20.9
     overall = 0.5815                                         max =         21

                                                F(27,1967)        =     131.72
corr(u_i, Xb)  = 0.6650                         Prob > F          =     0.0000

--------------------------------------------------------------------------------------
            loggdppc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------------------+----------------------------------------------------------------
                  ka |
                 L1. |  -.0403811   .0397236    -1.02   0.309     -.118286    .0375237
                     |
         incgroup_th |
Lower Middle Income  |   .1457912   .0257779     5.66   0.000     .0952363     .196346
Upper Middle Income  |   .3751304   .0319848    11.73   0.000     .3124027    .4378582
        High Income  |   .4449389   .0383314    11.61   0.000     .3697644    .5201134
                     |
   incgroup_th#cL.ka |
Lower Middle Income  |   .0749596   .0360304     2.08   0.038     .0042979    .1456214
Upper Middle Income  |  -.0360145   .0431999    -0.83   0.405    -.1207369    .0487079
        High Income  |  -.0693851   .0488317    -1.42   0.156    -.1651525    .0263823
                     |
                year |
               1997  |    .018742   .0167662     1.12   0.264    -.0141394    .0516234
               1998  |    .029583   .0167891     1.76   0.078    -.0033433    .0625093
               1999  |   .0418188    .016792     2.49   0.013     .0088868    .0747507
               2000  |   .0743435   .0167547     4.44   0.000     .0414846    .1072024
               2001  |   .0902911    .016751     5.39   0.000     .0574395    .1231427
               2002  |   .1016007   .0167662     6.06   0.000     .0687192    .1344821
               2003  |   .1296795   .0167714     7.73   0.000      .096788     .162571
               2004  |   .1620865   .0168044     9.65   0.000     .1291302    .1950429
               2005  |   .1879397   .0168483    11.15   0.000     .1548972    .2209821
               2006  |   .2263145   .0168752    13.41   0.000     .1932195    .2594096
               2007  |   .2638747   .0169032    15.61   0.000     .2307246    .2970249
               2008  |    .264955   .0170949    15.50   0.000     .2314291     .298481
               2009  |   .2463653   .0170868    14.42   0.000      .212855    .2798755
               2010  |   .2712964   .0171339    15.83   0.000     .2376938    .3048989
               2011  |   .2974869   .0171138    17.38   0.000     .2639237    .3310501
               2012  |   .3036946   .0172869    17.57   0.000      .269792    .3375973
               2013  |   .3215378   .0173358    18.55   0.000     .2875393    .3555364
               2014  |   .3400188   .0173641    19.58   0.000     .3059649    .3740727
               2015  |   .3602189   .0173665    20.74   0.000     .3261602    .3942775
               2016  |   .3808082    .017317    21.99   0.000     .3468465    .4147699
                     |
               _cons |   8.994969    .030888   291.21   0.000     8.934392    9.055545
---------------------+----------------------------------------------------------------
             sigma_u |  .96225507
             sigma_e |  .11780779
                 rho |  .98523252   (fraction of variance due to u_i)
--------------------------------------------------------------------------------------
F test that all u_i=0: F(99, 1967) = 209.90                  Prob > F = 0.0000

Here are the results of the regression for the low income sample, which is the base group of the previous regression.

Code:

. xtreg loggdppc L.ka I.year if incgroup_th==1, fe

Fixed-effects (within) regression               Number of obs     =        301
Group variable: country2                        Number of groups  =         25

R-sq:                                           Obs per group:
     within  = 0.7095                                         min =          1
     between = 0.4164                                         avg =       12.0
     overall = 0.0050                                         max =         21

                                                F(21,255)         =      29.66
corr(u_i, Xb)  = -0.4428                        Prob > F          =     0.0000

------------------------------------------------------------------------------
    loggdppc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          ka |
         L1. |   .2355757   .0572047     4.12   0.000     .1229219    .3482295
             |
        year |
       1997  |   .0419334   .0341734     1.23   0.221    -.0253646    .1092314
       1998  |   .0661726   .0350231     1.89   0.060    -.0027988     .135144
       1999  |   .0856925    .035103     2.44   0.015     .0165638    .1548213
       2000  |    .099322   .0341923     2.90   0.004     .0319868    .1666573
       2001  |   .1287047   .0341876     3.76   0.000     .0613788    .1960306
       2002  |    .150422   .0351686     4.28   0.000      .081164    .2196799
       2003  |   .1780879   .0350638     5.08   0.000     .1090364    .2471394
       2004  |    .217291   .0360973     6.02   0.000     .1462043    .2883778
       2005  |   .2722467    .037468     7.27   0.000     .1984605    .3460329
       2006  |   .3210197   .0375514     8.55   0.000     .2470694      .39497
       2007  |   .3702423   .0375416     9.86   0.000     .2963114    .4441733
       2008  |   .4095371   .0453309     9.03   0.000     .3202665    .4988078
       2009  |   .4345497   .0451828     9.62   0.000     .3455708    .5235286
       2010  |   .4641847   .0450854    10.30   0.000     .3753976    .5529718
       2011  |   .5113764   .0450854    11.34   0.000     .4225894    .6001635
       2012  |   .5211868   .0526365     9.90   0.000     .4175292    .6248443
       2013  |   .5580252   .0526365    10.60   0.000     .4543677    .6616828
       2014  |   .5769499   .0566277    10.19   0.000     .4654323    .6884675
       2015  |   .6101766   .0566277    10.78   0.000      .498659    .7216942
       2016  |   .6404074   .0566277    11.31   0.000     .5288898     .751925
             |
       _cons |   7.264662   .0459164   158.22   0.000     7.174238    7.355086
-------------+----------------------------------------------------------------
     sigma_u |  .53813001
     sigma_e |  .11135118
         rho |  .95894112   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(24, 255) = 209.43                   Prob > F = 0.0000

I thought the constant and the coefficient should be the same for the second regression and the base level of the first regression. However, as you can see they are completely different. This is the same for the other income groups, which I won't post here due to space. I'm not sure now which results I should trust. I would be very grateful if someone could tell me what I'm missing.

Tags: None

Wouter Wakker

Join Date: Nov 2018

Posts: 621
#2

21 Apr 2019, 08:50

I played around a little bit more with the data and I think the difference is that the demeaning is done in a different way because countries can change their income groups. Therefore, in the first model demeaning is done per country for the whole period, whole in the second model the demeaning is done only over a subset of each country (when it was a low income country). In this case I'm looking to control for country fixed effects so I think it would make more sense to do the demeaning for the whole period and use interactions. However, I'm not entirely sure about this, any help is highly appreciated.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#3

21 Apr 2019, 16:44

Well, what you explain in #2 may account for some of the difference.

But even if income groups were unchanging, with fixed effects regressions you cannot expect the results of stratified regressions to match those of a combined regression with interaction. The reason is actually fairly simple. In general, when you add or remove covariates from any kind of regression model, the results can change. They can change drastically: large differences in magnitude, opposite signs, whatever. In fact, you should, in general, expect that they will change when you do that. The results of -reg y x1 x2- and the results of -reg y x1 x3- will, in general, not resemble each other. Now, with -xtreg, fe-, even though the fixed effects are implemented in Stata by de-meaning, the process is completely equivalent to introducing a bunch of covariates: an indicator ("dummy") variable for each country (save one reference category). So if you do -xtreg, fe- on subsets of the full data set that contain different countries, you are changing the covariates in the model! And the results do not need to agree.

Even outside the context of fixed effects regression, you might have a simple regression model such as -regress y x- and you are interested in knowing whether the effect of x on y changes according to, say, sex of the person. If you do -regress y sex##x- and compare the results to -regress y x if sex == "male"- and -regress y x if sex == "female"- they will be consistent (if you know how to do it correctly--this is easy to mess up.) But if there is another variable in the model, say -regress y x w-, then in general the results of -regress y sex##x w- do not have to agree with separately running -regress y x w if sex == "male"- and -regress y x w if sex == "female"-. The reason here is that the interaction model implicitly constrains the effect of w to be the same in both sexes, whereas in the separate models there is no such constraint, and the absence of such constraint makes it possible for w itself to change the estimates of x differently. In the case of a simple regression, you can get around this problem by running a more complicated interaction model, one in which sex is interacted with all of the predictors. -regress y sex##(x w)-. The results of this will agree with the results of separate male and female regressions of y on x and w.

Now, when working with -xtreg, fe-, you cannot do that same trick, because there is no syntax in -xtreg, fe- that allows you to tell Stata to interact incgroup_th with all of the country fixed effects. (This is because, in Stata's implementation of -xtreg, fe- there aren't actually any country indicators--it's done by demeaning instead, and there is no syntax to "interact" incgroup_th with demeaning.) In this case, you would have to do it by emulating -xtreg, fe- using -regress- with country indicators as covariates. So something like -regress (c.(loggdppc L.ka) i.year i.country2) ##incgroup_th- would do that. (Of course, if the number of values of country2 is really large, you won't have enough matrix space to run this, but for just the 25 you have it shouldn't be a problem.) And these results would agree with the results of separate regressions.

But you have a problem: what you have found is that when you just run an overall model with interaction of incgroup_th with just c.l.ka you get results that are strikingly different from those in the separate regressions. Opposite signs, and the confidence intervals don't even overlap at all. The implication in light of the above is that there are important interactions of incgroup_th with year or with country fixed effects (or both). If there weren't, the results would more or less agree. So these analyses are telling you that a model based only on the limited interaction is mis-specified. Consequently, you should either use separate regressions, or if you combine them into an interaction model, the interaction must include all the model variables and all the countries as well, not just c.l.ka.
Comment
Wouter Wakker

Join Date: Nov 2018

Posts: 621
#4

24 Apr 2019, 14:31

Thank you Clyde Schechter for your extremely clear and helpful explanation. I didn't understand that you have to interact the country and year fixed effects with the group indicator to get the same results as splitting the sample, but indeed,

Code:

xtreg loggdppc (c.L.ka i.year)##incgroup_th i.country2#incgroup_th, fe

yields exactly the same coefficients as

Code:

xtreg loggdppc L.ka i.year if incgroup_th==1, fe xtreg loggdppc L.ka i.year if incgroup_th==2, fe xtreg loggdppc L.ka i.year if incgroup_th==3, fe xtreg loggdppc L.ka i.year if incgroup_th==4, fe

where the first model has the advantage of having smaller standard errors.

Thank you again!

Last edited by Wouter Wakker; 24 Apr 2019, 15:28.
Comment

Announcement

Sample split or interactions in fixed effects regression

Comment

Comment

Comment