Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sample split or interactions in fixed effects regression

    Dear Stata community,

    I'm currently analyzing whether capital controls affect economic development of countries in different ways for different levels of income. To differentiate countries I use the World Bank categorization (low, lower middle, upper middle, high income). I want to capture year specific effects and country effects so I use fixed effects.

    The data includes 100 countries and 22 years and looks like this:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long country2 int year double ka float loggdppc long incgroup_th
    1 1995   1 9.137277 2
    1 1996 .75 9.160067 2
    1 1997   1 9.154971 2
    1 1998 .75 9.189805 2
    1 1999 .75 9.207232 2
    end
    label values country2 country2
    label def country2 1 "Algeria", modify
    label values incgroup_th incomegroup5
    label def incomegroup5 2 "Lower Middle Income", modify
    I thought of running the regression in two ways: first, by including interaction effects like this:
    Code:
    xtreg loggdppc c.L.ka##I.incgroup_th I.year, fe
    Second, by splitting the sample like this:
    Code:
    xtreg loggdppc L.ka I.year if incgroup_th==1, fe
    xtreg loggdppc L.ka I.year if incgroup_th==2, fe
    xtreg loggdppc L.ka I.year if incgroup_th==3, fe
    xtreg loggdppc L.ka I.year if incgroup_th==4, fe
    However, when I run these regressions I get very different results, which I don't understand. Here are the results of the first regression:
    Code:
    . xtreg loggdppc c.L.ka##I.incgroup_th I.year, fe
    
    Fixed-effects (within) regression               Number of obs     =      2,094
    Group variable: country2                        Number of groups  =        100
    
    R-sq:                                           Obs per group:
         within  = 0.6439                                         min =         17
         between = 0.9173                                         avg =       20.9
         overall = 0.5815                                         max =         21
    
                                                    F(27,1967)        =     131.72
    corr(u_i, Xb)  = 0.6650                         Prob > F          =     0.0000
    
    --------------------------------------------------------------------------------------
                loggdppc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    ---------------------+----------------------------------------------------------------
                      ka |
                     L1. |  -.0403811   .0397236    -1.02   0.309     -.118286    .0375237
                         |
             incgroup_th |
    Lower Middle Income  |   .1457912   .0257779     5.66   0.000     .0952363     .196346
    Upper Middle Income  |   .3751304   .0319848    11.73   0.000     .3124027    .4378582
            High Income  |   .4449389   .0383314    11.61   0.000     .3697644    .5201134
                         |
       incgroup_th#cL.ka |
    Lower Middle Income  |   .0749596   .0360304     2.08   0.038     .0042979    .1456214
    Upper Middle Income  |  -.0360145   .0431999    -0.83   0.405    -.1207369    .0487079
            High Income  |  -.0693851   .0488317    -1.42   0.156    -.1651525    .0263823
                         |
                    year |
                   1997  |    .018742   .0167662     1.12   0.264    -.0141394    .0516234
                   1998  |    .029583   .0167891     1.76   0.078    -.0033433    .0625093
                   1999  |   .0418188    .016792     2.49   0.013     .0088868    .0747507
                   2000  |   .0743435   .0167547     4.44   0.000     .0414846    .1072024
                   2001  |   .0902911    .016751     5.39   0.000     .0574395    .1231427
                   2002  |   .1016007   .0167662     6.06   0.000     .0687192    .1344821
                   2003  |   .1296795   .0167714     7.73   0.000      .096788     .162571
                   2004  |   .1620865   .0168044     9.65   0.000     .1291302    .1950429
                   2005  |   .1879397   .0168483    11.15   0.000     .1548972    .2209821
                   2006  |   .2263145   .0168752    13.41   0.000     .1932195    .2594096
                   2007  |   .2638747   .0169032    15.61   0.000     .2307246    .2970249
                   2008  |    .264955   .0170949    15.50   0.000     .2314291     .298481
                   2009  |   .2463653   .0170868    14.42   0.000      .212855    .2798755
                   2010  |   .2712964   .0171339    15.83   0.000     .2376938    .3048989
                   2011  |   .2974869   .0171138    17.38   0.000     .2639237    .3310501
                   2012  |   .3036946   .0172869    17.57   0.000      .269792    .3375973
                   2013  |   .3215378   .0173358    18.55   0.000     .2875393    .3555364
                   2014  |   .3400188   .0173641    19.58   0.000     .3059649    .3740727
                   2015  |   .3602189   .0173665    20.74   0.000     .3261602    .3942775
                   2016  |   .3808082    .017317    21.99   0.000     .3468465    .4147699
                         |
                   _cons |   8.994969    .030888   291.21   0.000     8.934392    9.055545
    ---------------------+----------------------------------------------------------------
                 sigma_u |  .96225507
                 sigma_e |  .11780779
                     rho |  .98523252   (fraction of variance due to u_i)
    --------------------------------------------------------------------------------------
    F test that all u_i=0: F(99, 1967) = 209.90                  Prob > F = 0.0000
    Here are the results of the regression for the low income sample, which is the base group of the previous regression.
    Code:
    . xtreg loggdppc L.ka I.year if incgroup_th==1, fe
    
    Fixed-effects (within) regression               Number of obs     =        301
    Group variable: country2                        Number of groups  =         25
    
    R-sq:                                           Obs per group:
         within  = 0.7095                                         min =          1
         between = 0.4164                                         avg =       12.0
         overall = 0.0050                                         max =         21
    
                                                    F(21,255)         =      29.66
    corr(u_i, Xb)  = -0.4428                        Prob > F          =     0.0000
    
    ------------------------------------------------------------------------------
        loggdppc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
              ka |
             L1. |   .2355757   .0572047     4.12   0.000     .1229219    .3482295
                 |
            year |
           1997  |   .0419334   .0341734     1.23   0.221    -.0253646    .1092314
           1998  |   .0661726   .0350231     1.89   0.060    -.0027988     .135144
           1999  |   .0856925    .035103     2.44   0.015     .0165638    .1548213
           2000  |    .099322   .0341923     2.90   0.004     .0319868    .1666573
           2001  |   .1287047   .0341876     3.76   0.000     .0613788    .1960306
           2002  |    .150422   .0351686     4.28   0.000      .081164    .2196799
           2003  |   .1780879   .0350638     5.08   0.000     .1090364    .2471394
           2004  |    .217291   .0360973     6.02   0.000     .1462043    .2883778
           2005  |   .2722467    .037468     7.27   0.000     .1984605    .3460329
           2006  |   .3210197   .0375514     8.55   0.000     .2470694      .39497
           2007  |   .3702423   .0375416     9.86   0.000     .2963114    .4441733
           2008  |   .4095371   .0453309     9.03   0.000     .3202665    .4988078
           2009  |   .4345497   .0451828     9.62   0.000     .3455708    .5235286
           2010  |   .4641847   .0450854    10.30   0.000     .3753976    .5529718
           2011  |   .5113764   .0450854    11.34   0.000     .4225894    .6001635
           2012  |   .5211868   .0526365     9.90   0.000     .4175292    .6248443
           2013  |   .5580252   .0526365    10.60   0.000     .4543677    .6616828
           2014  |   .5769499   .0566277    10.19   0.000     .4654323    .6884675
           2015  |   .6101766   .0566277    10.78   0.000      .498659    .7216942
           2016  |   .6404074   .0566277    11.31   0.000     .5288898     .751925
                 |
           _cons |   7.264662   .0459164   158.22   0.000     7.174238    7.355086
    -------------+----------------------------------------------------------------
         sigma_u |  .53813001
         sigma_e |  .11135118
             rho |  .95894112   (fraction of variance due to u_i)
    ------------------------------------------------------------------------------
    F test that all u_i=0: F(24, 255) = 209.43                   Prob > F = 0.0000
    I thought the constant and the coefficient should be the same for the second regression and the base level of the first regression. However, as you can see they are completely different. This is the same for the other income groups, which I won't post here due to space. I'm not sure now which results I should trust. I would be very grateful if someone could tell me what I'm missing.

  • #2
    I played around a little bit more with the data and I think the difference is that the demeaning is done in a different way because countries can change their income groups. Therefore, in the first model demeaning is done per country for the whole period, whole in the second model the demeaning is done only over a subset of each country (when it was a low income country). In this case I'm looking to control for country fixed effects so I think it would make more sense to do the demeaning for the whole period and use interactions. However, I'm not entirely sure about this, any help is highly appreciated.

    Comment


    • #3
      Well, what you explain in #2 may account for some of the difference.

      But even if income groups were unchanging, with fixed effects regressions you cannot expect the results of stratified regressions to match those of a combined regression with interaction. The reason is actually fairly simple. In general, when you add or remove covariates from any kind of regression model, the results can change. They can change drastically: large differences in magnitude, opposite signs, whatever. In fact, you should, in general, expect that they will change when you do that. The results of -reg y x1 x2- and the results of -reg y x1 x3- will, in general, not resemble each other. Now, with -xtreg, fe-, even though the fixed effects are implemented in Stata by de-meaning, the process is completely equivalent to introducing a bunch of covariates: an indicator ("dummy") variable for each country (save one reference category). So if you do -xtreg, fe- on subsets of the full data set that contain different countries, you are changing the covariates in the model! And the results do not need to agree.

      Even outside the context of fixed effects regression, you might have a simple regression model such as -regress y x- and you are interested in knowing whether the effect of x on y changes according to, say, sex of the person. If you do -regress y sex##x- and compare the results to -regress y x if sex == "male"- and -regress y x if sex == "female"- they will be consistent (if you know how to do it correctly--this is easy to mess up.) But if there is another variable in the model, say -regress y x w-, then in general the results of -regress y sex##x w- do not have to agree with separately running -regress y x w if sex == "male"- and -regress y x w if sex == "female"-. The reason here is that the interaction model implicitly constrains the effect of w to be the same in both sexes, whereas in the separate models there is no such constraint, and the absence of such constraint makes it possible for w itself to change the estimates of x differently. In the case of a simple regression, you can get around this problem by running a more complicated interaction model, one in which sex is interacted with all of the predictors. -regress y sex##(x w)-. The results of this will agree with the results of separate male and female regressions of y on x and w.

      Now, when working with -xtreg, fe-, you cannot do that same trick, because there is no syntax in -xtreg, fe- that allows you to tell Stata to interact incgroup_th with all of the country fixed effects. (This is because, in Stata's implementation of -xtreg, fe- there aren't actually any country indicators--it's done by demeaning instead, and there is no syntax to "interact" incgroup_th with demeaning.) In this case, you would have to do it by emulating -xtreg, fe- using -regress- with country indicators as covariates. So something like -regress (c.(loggdppc L.ka) i.year i.country2) ##incgroup_th- would do that. (Of course, if the number of values of country2 is really large, you won't have enough matrix space to run this, but for just the 25 you have it shouldn't be a problem.) And these results would agree with the results of separate regressions.

      But you have a problem: what you have found is that when you just run an overall model with interaction of incgroup_th with just c.l.ka you get results that are strikingly different from those in the separate regressions. Opposite signs, and the confidence intervals don't even overlap at all. The implication in light of the above is that there are important interactions of incgroup_th with year or with country fixed effects (or both). If there weren't, the results would more or less agree. So these analyses are telling you that a model based only on the limited interaction is mis-specified. Consequently, you should either use separate regressions, or if you combine them into an interaction model, the interaction must include all the model variables and all the countries as well, not just c.l.ka.

      Comment


      • #4
        Thank you Clyde Schechter for your extremely clear and helpful explanation. I didn't understand that you have to interact the country and year fixed effects with the group indicator to get the same results as splitting the sample, but indeed,
        Code:
        xtreg loggdppc (c.L.ka i.year)##incgroup_th i.country2#incgroup_th, fe
        yields exactly the same coefficients as
        Code:
        xtreg loggdppc L.ka i.year if incgroup_th==1, fe
        xtreg loggdppc L.ka i.year if incgroup_th==2, fe 
        xtreg loggdppc L.ka i.year if incgroup_th==3, fe
        xtreg loggdppc L.ka i.year if incgroup_th==4, fe
        where the first model has the advantage of having smaller standard errors.

        Thank you again!
        Last edited by Wouter Wakker; 24 Apr 2019, 15:28.

        Comment

        Working...
        X