Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difference in Differences model

    Hi Everyone!

    I am trying to use the difference in differences model where I am struggling at the moment and would really appreciate your help!

    I have read around and I think I have understood how to do a difference -in-differences regression analysis in STATA by looking at your posts on the Statalist forum.

    I am looking the effect of the Euro on trade in Europe. I have a control group and a treatment group ("pairEURO" equal to 1 if they adopted the euro in 1999)

    Code:
    gen TREATMENT = if(pairEURO =1)
    gen POST = 1 if(year > 1999) 
    gen INTERACTION = treatment*post
    
    
    xtset CountryPair YEAR
    xtreg lnTradeFlow CommonLanguage lnPopulationSizej lnPopulationSizei lnGDPj lnGDPi lnDistance TREATMENT POST INTERACTION i.year, fe robust
    1) My TREAT variable gets omitted as it is constant over time but my POST variable does not. Does this mean I made a mistake? and did not correctly coded in your data?
    I believe that it doesn't get omitted as it is not equal to 1 until the "Policy" is implemented, hence STATA did not omitted it, am I right?

    2) It is better to do fixed effects than OLS because it take into account for effects such as "Distance" and "CommonLanguage" which are constant over time, am I correct? (The treatment dummy will also get omitted as it is fixed over time)

    3)Is this the correct way of proceeding? And I am interested in the "INTERACTION" variable coefficient which I then do e^(Coefficient - 1) * 100 to get the percentage effect?

    I am using STATA 14

    I hope I am clear! Thank you very much in advance!

    Best Regards,

    Joseph

  • #2
    1) My TREAT variable gets omitted as it is constant over time but my POST variable does not. Does this mean I made a mistake? and did not correctly coded in your data?
    I believe that it doesn't get omitted as it is not equal to 1 until the "Policy" is implemented, hence STATA did not omitted it, am I right?
    It's not a mistake. In a fixed-effects regression, any variable that is constant within panel (CountryPair) will be colinear with the fixed effect and will be dropped automatically. It's not a quirk of Stata either. Any within-panel effects estimator necessarily does this: it is in principle impossible to estimate effects that are constant within panel. This is one of the uncommon situations in which it is sensible to have an interaction term but omit one of its component main effects.

    Remember, too, that if you were able to retain the TREATMENT variable (which you could do in a random effects model or with OLS), its interpretation in the context of an interaction model is not what it seems. It would not be an effect of the treatment at all. It would be an estimate of the expected difference in the expected value of lnTradeFlow between the treated and untreated groups before treatment was begun. The effects of the treatment are all embodied in the contributions of the POST and INTERACTION variables.

    It is better to do fixed effects than OLS because it take into account for effects such as "Distance" and "CommonLanguage" which are constant over time, am I correct? (The treatment dummy will also get omitted as it is fixed over time)
    If it is important to you to obtain estimates of the effects of Distance and CommonLanguage, then you have to abandon the fixed-effects estimator. But if you are not interested in estimating those effects and only want to adjust for them, the fixed-effects estimator is an excellent way to do that, as it also adjusts for unobserved (and even un-thought-of) time-invariant differences among the panels. If you do want to estimate those effects, the simplest modification, I think, would be to use the between-effects estimator (-be- option to -xtreg-).

    The use of OLS on panel data can be problematic. However, if you look at the very last line of your -xtreg- output, you will see an F-test of the hypothesis that all u_i = 0. If you do not reject that hypothesis (and here I mean not just p > 0.05, but p comfortably greater than 0.05 or a very large N of panel vars) then you can safely use OLS. Another way to retain time-invariant variables is to use the random-effects model. In finance and economics it is generally considered de rigeur if not absolutely mandatory to do a Hausman test.

    3)Is this the correct way of proceeding? And I am interested in the "INTERACTION" variable coefficient which I then do e^(Coefficient - 1) * 100 to get the percentage effect?
    This approach looks generally reasonable. I can't comment about the content of your model as I have no expertise in finance or economics, but the overall structure as an interaction model to estimate differences in differences is right. As for what you are interested in, the usual focus of interest in these is the coefficient of the interaction term. That estimates the expected difference between the change (post vs pre) in outcome in the treatment group and the change in outcome in the control group. That is the actual difference that the treatment made, over and above whatever difference between the groups may have existed prior to treatment. Since your outcome variable is the log transform of TradeFlow, if you are interested in the percentage change in TradeFlow associated with the treatment effect, the correct formula is 100*(exp(coefficient)-1). That is, first exponentiate the coefficient and then subtract one, not the reverse.

    Comment


    • #3
      Thanks a lot for you answer Clyde!


      I also wanted to ask you if it was a good idea to add year effects (i.year), will this only take into account the year after the introduction of the Euro for the interaction variable? as it is a dummy variable (treatment*post)

      I am only interested in the coefficient of the interaction term, hence I believe FE is the right regression. Although I add year effects to take out unobserved heterogeneity such as the financial crisis but I am now worried that my interaction term will only captured the year after.



      Thanks a lot!

      Best Regards,

      Joseph

      Comment


      • #4
        You can add i.year to your model if there is good reason to do so. But be careful how you code it: POST will be colinear with the full set of i.year indicators, it cannot coexist with all of them in the model. If you list the i.year before i.treatment##i.post, Stata will drop post. You will still have treatment#post, which is your most important variable, but you will no longer be able to estimate what happened in the non-treatment country pairs following the onset of the treatment. If, however, you list i.treatment##i.post before i.year, you will lose a second year indicator to resolve the colinearity. Assuming that year is just a nuisance variable whose effects you want to adjust for, this is a better aproach. Try running this code to see this for yourself:

        Code:
        clear*
        // CREATE SOME ARTIFICIAL DATA
        set obs 10
        set seed 1234
        gen country_pair = _n
        gen u = rnormal(0, 0.25)
        expand 10
        by country_pair, sort: gen year = 1994+_n
        gen byte treatment = (country_pair <= 5)
        gen byte post = (year > 1999)
        gen xb = 0.2*treatment + 0.2*post + treatment*post
        gen outcome = xb + u + rnormal(0, 0.25)
        xtset country_pair year
        
        // DIFFERENCE IN DIFFERENCES WITHOUT YEAR EFFECTS
        xtreg outcome i.treatment##i.post, fe
        
        // WITH YEAR EFFECTS ADDED AT END
        xtreg outcome i.treatment##i.post i.year, fe
        
        // WITH YEAR EFFECTS ADDED AT BEGINNING
        xtreg outcome i.year i.treatment##i.post, fe

        That said, if you are mainly concerned about adjusting for the effects of the financial crisis, rather than using a set of year indicators, why not just use a dichotomous variable indicating the years during the financial crisis? Or perhaps better still some economic variables that quantify the effects of the financial crisis continuously (perhaps overall GDP growth rates in the countries in each pair). That strikes me as a better way of getting at that (though I am no economist--so consult your colleagues about this.)

        The use of i.year indicators is a broader adjustment that is appropriate if there are large "shocks" to the outcome from year to year that must be adjusted for. But if you can pinpoint more specific influences such as the financial crisis, that seems to yield a more explanatory model. A model with indicators for specific years is inherently incapable of generalizing to other time periods, whereas a model with adjustments for a financial crisis might be generalizable to future periods that also contain a financial crisis.

        Comment


        • #5
          Hi again Clyde!

          I am a bit confused, when I include my time invariant variables in my FE regression I get a different result for the coefficient I am interested in than of I took them out, is this normal? if so which one should I use in this case?


          Code:
            
          xtset CountryPair YEAR  
          
          xtreg lnTradeFlow i.year CommonLanguage lnPopulationSizej lnPopulationSizei lnGDPj lnGDPi lnDistance TREATMENT POST INTERACTION, fe robust
          
          xtreg lnTradeFlow i.year lnPopulationSizej lnPopulationSizei lnGDPj lnGDPi INTERACTION, fe robust

          My coefficient of the interaction variable differ is these 2 regressions, although I removed "CommonLanguage" "lnDistance" "TREATMENT" "POST" variables which would of been omitted anyway....

          and when I keep these variables, I realized that some country-pairs get omitted, therefore I have more observation for the FE without these variables added... I am not really sure what to do in this case?

          Is it the correct way to include these omitted variables?

          Thank you so much again for your help!!

          Best Regards,

          Joseph
          Last edited by joseph dover; 01 Mar 2016, 08:30.

          Comment


          • #6
            and when I keep these variables, I realized that some country-pairs get omitted
            This presumably happens because some observations have missing values for those variables, so including them causes those observations to be excluded from the estimation sample. So, of course, when you change the estimation sample, the results change.

            I don't understand why you want to include the variables CommonLanguage and lnDistance in the model: they are constant within country pair and will automatically be dropped due to colinearity with the country-pair fixed effect. So at best, they will add nothing to your analysis. In this case, due to missing values, they are also eroding your estimation sample. Unless there is something about country pairs for which missingness of these variables signals that they are actually "not in universe" for your research question, omitting these observations is most unhelpful: at the least it decreases statistical power and it may very well introduce bias as well. So I think I would not include those variables.

            Comment


            • #7

              On those country-pair constant variables: I echo Clyde on the problem of including country-pair variables: in a fixed effects model they will be dropped.

              You should see country-pair constant variables drop in your Stata output in FE panel model; otherwise you've got problems with your data. I've altered some of Clydes code to add two variables that are country-pair constants, just so you can see (and compare) with our own analyses (it's possible that your variables are not being dropped because they are not constants within country-pair, if there are slight changes year to year; nevertheless, if you're unable to model country-pair constants in a FE regression, then something is amiss with your constaints.)

              You'll see the output with the constants dropped if you run Clyde's syntax with two contstants added.

              On panel models that permit your country-pair variables: Now you'll see an option if you want to model those constants: multilevel/mixed models. In the artificial example below, I added a regression that permits you to simultateneously model country-specific models. The question is primarily theoretical: are the years a nuisance to be done away with, or are they part of the data structure you are interested in modeling?

              A great resource: Finally, I'd strongly recommend Chapter 8 of Cameron and Trivedi, who have these models (and others). It provides example syntax of these and other models (stata). Start with page p. 255 first try the Population Averaged Measure (PA) with Unstructured Error (xtreg, pa). Then go from there to other models. Run a hausman test to help select FE vs. RE (e.g., p. 267, hausman FE_model RE_model , sigmamore). There's a lot of material, but it's worth reading.

              Clyde's example, with (1) dropped constants (cons1 cons2) and (2) included constants in a mixed model


              Code:
              clear*
              // CREATE SOME ARTIFICIAL DATA
              set obs 10
              set seed 1234
              gen country_pair = _n
              gen u = rnormal(0, 0.25)
              
              * NB: add your country-level constants
              by country_pair, sort: gen cons1 = runiformint(1,5)
              by country_pair, sort: gen cons2 = runiformint(1,5)
              
              expand 10
              by country_pair, sort: gen year = 1994+_n
              
              
              gen byte treatment = (country_pair <= 5)
              gen byte post = (year > 1999)
              gen xb = 0.2*treatment + 0.2*post + treatment*post
              gen outcome = xb + u + rnormal(0, 0.25)
              xtset country_pair year
              
              
              // WITH YEAR EFFECTS ADDED AT BEGINNING
              // NB: SEE DROPPED CONSTANTS, CONS1 AND CONS2
              xtreg outcome i.year i.treatment##i.post cons1 cons2, fe
              
              // MIXED MODEL TO PERMIT INCLUSION OF CONSTANTS
              xtmixed outcome i.treatment##i.post cons1 cons2 || year:, reml
              Nathan E. Fosse, PhD
              [email protected]

              Comment


              • #8
                Code:
                 
                           |                                                total_asset_cat
                      year |         0          1          2          3          4          5          6          7          8          9 |     Total
                -----------+--------------------------------------------------------------------------------------------------------------+----------
                      2010 |        70         41          9         31          2          4          0          0          1          2 |       160 
                      2011 |       108         32         13          2          5          5          3          2          2          2 |       174 
                      2012 |       100         26         24          8          5         12          5         13          1          1 |       195 
                      2013 |       120         47         26         10          3         14          0          6          5          2 |       233 
                      2014 |       122         48         22         10         22         14          2          5          8          4 |       257 
                      2015 |        96         53         11         17          5          5          1         12          6          2 |       208 
                -----------+--------------------------------------------------------------------------------------------------------------+----------
                     Total |       616        247        105         78         42         54         11         38         23         13 |     1,227
                1


                Comment


                • #9
                  A difference in difference model without year dummies is biased if you have more than 2 periods. See wooldridges nber summer institute lecture. The most robust way to estimate a diff in diff is a two way fixed effects model with a binary policy indicator (treatment*post). There is no reason to interpret any coefficient in a diff in diff model but that one, because it is the only one that is identified given the assumptions of the model.

                  Comment

                  Working...
                  X