Using Difference-in-Differences Analysis

Steve Asare

Join Date: Jul 2020

Posts: 16
#1

Using Difference-in-Differences Analysis

29 Jul 2020, 09:59

Dear All,

I am currently facing challenges regards the use of DiD. I have a longitudinal data with two waves only (2009/2010 and 2014/2015). I am to analyze the changes in wealth due to child fostering. The fostering variable is an indicator variable with 1= foster, 0= non-foster. I constructed a wealth index using the MCA.

Is it appropriate to run a DiD by defining households who received foster children as my treatment variable and households who did not as the control variable with my time variable as year>2014 (being the follow-up data)?

Code:
gen time= (combdate >= td(05mar2014)) & !missing(year)
gen treatment = (foster_status==1) & !missing(foster_status)

My model is specified as follows:
W_it=B₀ + B₁Treatment + B₂Time + B₃Time*Treatment

Your advise would be highly appreciated.
Regards,
Stephen

Statalist

http://www.statalist.org
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

29 Jul 2020, 11:02

You have the right idea, assuming that the variable foster_status is always 0 or always 1 in all observations of the same household. But the code is not quite correct. It should be:

Code:

gen time= (combdate >= td(05mar2014)) if !missing(combdate)
gen treatment = (foster_status==1) if !missing(foster_status)

By the way, the code to create the variable treatment can be even further simplified to:

Code:

gen treatment = 1.foster_status
1 like
Comment
Steve Asare

Join Date: Jul 2020

Posts: 16
#3

29 Jul 2020, 12:38

I am very grateful for the response, Clyde.
Comment
Steve Asare

Join Date: Jul 2020

Posts: 16
#4

29 Jul 2020, 22:39

Follow-up question
After running the following code:

areg wealth treated did, a(foster_status) vce(robust)

where treated= treatment and did= time*treated, following from previous post.

The results showed that my 'did' variable was significant, but after adding other variables to the model, my 'did' variable became insignificant.

Is this a valid output or am i doing something wrong?

note: treated omitted because of collinearity

Linear regression, absorbing indicators Number of obs = 35,245
F( 1, 35242) = 32.03
Prob > F = 0.0000
R-squared = 0.0007
Adj R-squared = 0.0006
Root MSE = 0.7149

------------------------------------------------------------------------------
| Robust
wealth | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treated | 0 (omitted)
did | -.3566215 .0630114 -5.66 0.000 -.4801257 -.2331172
_cons | -.2865537 .0038511 -74.41 0.000 -.2941019 -.2790054
-------------+----------------------------------------------------------------
foster_sta~s | absorbed (2 categories)

areg wealth did treated c.child_age##c.child_age numchild i.gender c.age##c.age i.
> marital_status i.educqual i.region i.urbrur avghhsize_adj i.hhinc i.time, a(foster
> _status) vce(robust)
note: treated omitted because of collinearity

Linear regression, absorbing indicators Number of obs = 31,341
F( 20, 31319) = 108.11
Prob > F = 0.0000
R-squared = 0.1286
Adj R-squared = 0.1280
Root MSE = 0.6414

-----------------------------------------------------------------------------------
| Robust
wealth | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------------+----------------------------------------------------------------
did | -.0333492 .0609341 -0.55 0.584 -.1527824 .0860841
treated | 0 (omitted)
child_age1 | .0416836 .0049415 8.44 0.000 .0319979 .0513692
|
c.child_age1#|
c.child_age1 | -.0020881 .0002823 -7.40 0.000 -.0026414 -.0015348
|
numchild | .0479884 .0024866 19.30 0.000 .0431146 .0528621
|
gender |
5. Female | -.0181797 .0073199 -2.48 0.013 -.032527 -.0038323
age | .0029435 .0007537 3.91 0.000 .0014662 .0044208
|
c.age#c.age | -.0000313 8.61e-06 -3.63 0.000 -.0000481 -.0000144
|
marital_status |
1. married | -.0544142 .0124341 -4.38 0.000 -.0787855 -.0300428
2. Divorced/se.. | -.1742315 .0252655 -6.90 0.000 -.2237529 -.12471
|
educqual |
2. Basic | .0558607 .0115662 4.83 0.000 .0331905 .0785309
3. Secondary | -.0214995 .0174891 -1.23 0.219 -.0557789 .0127799
4. Post-Second.. | .0787309 .0188486 4.18 0.000 .041787 .1156748
5. Other | .0667422 .0099745 6.69 0.000 .0471919 .0862926
|
region |
1. North | -.1965014 .0183841 -10.69 0.000 -.2325349 -.1604679
|
urbrur |
5. Rural | -.0443188 .0091567 -4.84 0.000 -.0622662 -.0263714
avghhsize_adj | -.06198 .0020363 -30.44 0.000 -.0659713 -.0579887
|
hhinc |
low_income | -.0070525 .0123708 -0.57 0.569 -.0312998 .0171949
middle_income | .159465 .020203 7.89 0.000 .1198664 .1990636
high_income | .1238394 .0201587 6.14 0.000 .0843276 .1633512
|
1.time | -.2914502 .0065949 -44.19 0.000 -.3043765 -.2785239
_cons | -.0371366 .0231017 -1.61 0.108 -.0824168 .0081436
------------------+----------------------------------------------------------------
foster_status | absorbed (2 categories)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#5

29 Jul 2020, 23:42

There are several things going on here. First, you should never be surprised when you change a model and a coefficient changes. You wouldn't be able to solve omitted variable bias by including the variable if that weren't the case, right? Also, always remember that the difference between statistically significant and not statistically significant is not, itself statistically significant. So the real issue here is, how much did the interaction coefficient change, and for what specific reason did it do that.

In this case, the coefficient of did changed from about -.36 to -.03, which is a pretty large change. So we might wonder which variable or combination of variables that was added to the model accounted for that.

But before we sink a lot of energy into that, let's look more carefully at whether we did the analysis correctly in the first place. You did this analysis absorbing foster_status. But that's not right. You have longitudinal data, so what you need to absorb here is the variable that identifies individuals (or households, or whatever each observation represents) in your study. That will give you a different value of the did coefficient (with or without the other covariates). So first let's fix that error and then we can see what the new versions of the did coefficient are and perhaps look into the source of the change further.
Comment
Steve Asare

Join Date: Jul 2020

Posts: 16
#6

29 Jul 2020, 23:55

Great, thank you very much Clyde. Will work on these and revert.
Comment

Steve Asare

Join Date: Jul 2020
Posts: 16

30 Jul 2020, 01:18

FPrimary is the ID for households in my data, when i absorb that, this is what i get.

Code:

areg wealth treated did, a(FPrimary) vce(robust)

Linear regression, absorbing indicators         Number of obs     =     35,245
                                                F(   0,  29826)   =          .
                                                Prob > F          =          .
                                                R-squared         =     1.0000
                                                Adj R-squared     =     1.0000
                                                Root MSE          =     0.0000

------------------------------------------------------------------------------
             |               Robust
      wealth |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     treated |          0  (omitted)
         did |          0  (omitted)
       _cons |  -.2882232          .        .       .            .           .
-------------+----------------------------------------------------------------
    FPrimary |   absorbed                                    (5417 categories)

Besides, i just read from https://www.stata.com/manuals13/rareg.pdf that the "absorb(varname) specifies the categorical variable, which is to be included in the regression as if it were specified by dummy variable"

what must i do to resolve this?
Thank you.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#8

30 Jul 2020, 11:22

Something is wrong with the way you have coded treated or did, and also the time variable needs to be in the model.

It is expected that treated will be omitted, because it will be constant within FPrimary. But did should not be constant within primary: at least for the treated group it should be 0 before treatment and 1 afterwards. The fact that it was absorbed says that you have that wrong.

In any case, hand coding a did variable is just an invitation to make a mistake. First check that your treated and time variables are correct. Then do it as:

Code:

areg wealth i.treated##i.time, areg(FPrimary) vce(cluster FPrimary)

and you should be fine.

The variable treated will be omitted, but time, and the treated#time interaction will not be.
Comment

Steve Asare

Join Date: Jul 2020
Posts: 16

30 Jul 2020, 22:03

Thanks Clyde, I am really grateful for your prompt responses.

I was doing everything wrong from the beginning, individual observations from my data is 35, 245 and the number of households is 5417. Since i am doing a household level analysis. I have collapsed my data to be able to do that.

Meanwhile, I have tried using the above suggested code.

Code:

areg wealth i.treated##i.time, areg(FPrimary) vce(cluster FPrimary)
but i keep getting the error (r198): "groupvar() required"

To get around this, i used this code and it worked:

 reg wealth i.time##i.treated, vce(robust)

Linear regression                               Number of obs     =      5,417
                                                F(3, 5413)        =     190.60
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0086
                                                Root MSE          =     .96818

------------------------------------------------------------------------------
             |               Robust
      wealth |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      1.time |  -.3421045   .0144504   -23.67   0.000    -.3704332   -.3137759
   1.treated |  -.0479491   .0885624    -0.54   0.588    -.2215669    .1256688
             |
time#treated |
        1 1  |   .0471244   .0888942     0.53   0.596    -.1271439    .2213928
             |
       _cons |  -.0876378   .0144169    -6.08   0.000    -.1159007   -.0593749
------------------------------------------------------------------------------

and then as i include other variables of interest i obtain this:

reg wealth i.time##i.treated c.child_age##c.child_age numchild i.gender c.age##c.a ge i.marital_status i.educqual i.region i.urbrur avghhsize_adj i.hhinc, vce(robust)

Linear regression                               Number of obs     =      5,417
                                                F(21, 5395)       =      34.77
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1859
                                                Root MSE          =      .8788

-----------------------------------------------------------------------------------
                  |               Robust
           wealth |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
           1.time |  -.6091334   .0446078   -13.66   0.000    -.6965827   -.5216841
        1.treated |   .0764924    .080838     0.95   0.344    -.0819827    .2349674
                  |
     time#treated |
             1 1  |  -.1567473   .0901331    -1.74   0.082    -.3334445    .0199499
                  |
        child_age |  -.0102547   .0111167    -0.92   0.356    -.0320479    .0115386
                  |
      c.child_age#|
      c.child_age |   .0012477   .0006483     1.92   0.054    -.0000232    .0025187
                  |
         numchild |   .2211294   .0137352    16.10   0.000     .1942028    .2480561
                  |
           gender |
          Female  |  -.2098426   .0324123    -6.47   0.000    -.2733838   -.1463013
              age |    .017957   .0046023     3.90   0.000     .0089345    .0269794
                  |
      c.age#c.age |  -.0000842   .0000449    -1.87   0.061    -.0001722    3.86e-06
                  |
   marital_status |
         married  |    .141425   .0366668     3.86   0.000     .0695432    .2133068
Divorced/separ~d  |   .0184881   .0479958     0.39   0.700    -.0756031    .1125794
                  |
         educqual |
           Basic  |   .0202279   .0349925     0.58   0.563    -.0483716    .0888274
       Secondary  |  -.0725679   .0473287    -1.53   0.125    -.1653514    .0202155
  Post-Secondary  |   .0756635   .0557805     1.36   0.175    -.0336888    .1850158
           Other  |   .0558392   .0350746     1.59   0.111    -.0129212    .1245995
                  |
           region |
           North  |  -.0903384    .039311    -2.30   0.022    -.1674039   -.0132729
                  |
           urbrur |
           Rural  |  -.1103094   .0274132    -4.02   0.000    -.1640503   -.0565685
    avghhsize_adj |   -.220579   .0117509   -18.77   0.000    -.2436156   -.1975424
                  |
            hhinc |
      low_income  |   .1958051   .0388891     5.03   0.000     .1195668    .2720434
   middle_income  |   .0754736   .0420423     1.80   0.073    -.0069462    .1578933
     high_income  |   .1168129   .0515738     2.26   0.024     .0157075    .2179183
                  |
            _cons |  -.1972036   .1134548    -1.74   0.082    -.4196208    .0252137

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#10

30 Jul 2020, 22:15

So, you have a lot of covariates there, and some of them have effects that are pretty large (relative to the interaction coefficient). Again, I want to remind you that when you add or remove variables and change a model, everything is up for grabs and the variables that are in both models can look very different. There is nothing wrong with that, and as I pointed out in #5, it is what makes it possible to deal with omitted variable bias.

So there isn't any real reason you need to pursue this. But if you are curious which covariate(s) are leading to the change in the interaction term, you can just try re-running the model several times, each time omitting one of the covariates from the full model, and see what happens to the interaction coefficient in each case. It may turn out, by the way, that no one covariate on its own is largely responsible--it might be some combination of them, but chasing down which combination would be an enormous amount of work.

In addition to thinking about the changes resulting from adding or removing covariates as being about omitted-variable bias, you can also think of it as an example of Simpson's paradox (sometimes called Lord's paradox I the context of regression.) There's a very readable explainer of this on Wikipedia, and it may help you understand whether it is the adjusted or unadjusted model that properly answers your research question.
Comment
Steve Asare

Join Date: Jul 2020

Posts: 16
#11

30 Jul 2020, 23:12

Alright, Clyde. Thank you once again, you have been really helpful. Will look at what you suggested on the "curiosity bit"
Comment

Steve Asare

Join Date: Jul 2020
Posts: 16

#12

31 Jul 2020, 04:26

Please, how do i interprete the coefficients of my full DiD model. Do i use the ordinary interpretations of an OLS model?
Again, what about these other effects, using margins..

Code:

 margins, dydx(time) at(treated=(0 1))

Conditional marginal effects                    Number of obs     =      5,417
Model VCE    : Robust

Expression   : Linear prediction, predict()
dy/dx w.r.t. : 1.time

1._at        : treated         =           0

2._at        : treated         =           1

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.time       |
         _at |
          1  |  -.3421045   .0144504   -23.67   0.000    -.3704332   -.3137759
          2  |  -.2949801   .0877118    -3.36   0.001    -.4669305   -.1230297
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.


 margins time#treated

Adjusted predictions                            Number of obs     =      5,417
Model VCE    : Robust

Expression   : Linear prediction, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
time#treated |
        0 0  |  -.0876378   .0144169    -6.08   0.000    -.1159007   -.0593749
        0 1  |  -.1355869    .087381    -1.55   0.121    -.3068889    .0357151
        1 0  |  -.4297424   .0009838  -436.81   0.000    -.4316711   -.4278137
        1 1  |   -.430567   .0076104   -56.58   0.000    -.4454865   -.4156475
------------------------------------------------------------------------------

Graph.gph

Attached Files

Graph.gph (10.5 KB, 2 views)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#13

31 Jul 2020, 14:06

The interpretation of coefficients of non-interacted variables is the same as you are accustomed to. For variables that participate in interactions, it is different. The key thing to remember is that if you have a treated#time interaction in the model, then the coefficient of treated is no longer "the effect of treated" and the coefficient of time is no longer "the effect of time." In fact, in such a model, there is no such thing as "the effect of" either treatment or time. Rather, there are two such effects for each of those variables: one for when the other variable is 0 and another for when the other variable is 1. You can calculate them from the regression output. But it's easier to get them from -margins-. In the output you show in #12, you can see that when treated = 1, the expected difference in outcome between time = 0 and time = 1 is -.34, whereas when treated = 0, the expected change with time is -.29, The second table of -margins- output you show tells you what the expected values of the outcome variable are for each combination of time and treated.

Note, by the way, that instead of using the -at()- option in your first -margins- command, you could have written -margins treated, dydx(time)-. The results would have been the same, but instead of being labeled with values of _at, which you then have to cross-reference to the output above the table, they would have been labeled with the actual values of treated. Your results sugggest that at time 0 the two groups strat out with slightly different average outcome values, but by time 1 they converge to the same value (though that value is much more negative in both cases.) Of course we can't make too much out of that, because the difference between the outcomes at time 0 is pretty small, and the confidence intervals substantially overlap. So a more conservative reading of this would be simply that both groups show a marked decrease in the expected outcome over time, and they differ very little at either time.
Comment
Steve Asare

Join Date: Jul 2020

Posts: 16
#14

31 Jul 2020, 20:36

Thank you so much for this very clear explanation, Clyde.
Comment

Announcement