How to use difference in difference with many yearly observations and dummy-variables?

Susanne Daae

Join Date: Mar 2015
Posts: 22

How to use difference in difference with many yearly observations and dummy-variables?

09 May 2015, 07:29

Hi!

If anyone could help me with something, I would be very very grateful! =)

I have read around and I think I have understood how to do a difference -in-difference regression analysis in Stata. However I am not sure how I analyze many different yearly observations. Here is my case:

I have aboout 500 observations with 10 different treatments. I have split the data in clusters to do individual analysis' first, then I will attempt to do it on my entire dataset. I have 10 years of data, with calculated return on assets on the 500 companies. What I wish to analyze is the effect of direct airlines between clusters of companies. Therefore I have 10 treatment routs, and 51 untreated clusters. For example: If I have 3 areas, A, B, and C, and there is a direct route between A and B, but not between A and C. The companies in A with daughter companies in B are the treated companies, while the companies in A that has a daughter company in C is the control group.

I have the dummy-variable 1 if it’s treated, and 0 if it is not. I also have dummy-vaiables stating at which route it is treated.

I know that a normal DID-analysis looks like this:

gen treatment = route_1 == 1
gen after_reform = year > 2005
gen interaction = treatment*after_reform
regr ???? treatment after_reform interaction

My problem is now; how do I incorporate the 10 yearly ROAs? ROA03 ROA04 ROA05 etc etc. Because that is what I wish to analyze, the difference in the ROAs over the 10 years in the two groups, before and after a treatment.

A small part of my data set looks like this:

treated	year_treated	route_1	route_2	ROA03	ROA04	ROA05	ROA06	ROA07	ROA08	ROA09	ROA10
1	2008	1					.1051213	.1677096	.0880811	.1671309	.0701199
1	2008		1	.1321278	.1501537	.1646822	.0630993	.032966	-.1130877	-.0459326	-.0215173
1	2008	1				.4529801	.2062659	-.361311	.4088252	.2886758
0	0			.0160428	-.0166116	-.0071846	.0055485	.0318236	.0069174	.0270729	.0299224
0	0			.0879779	.1574574
0	0			.1601925	-.0806988	-.039929	.1219937	-.1583	-.0178174	.2113627	.0506658
0	0			-.1377799	-.3151261
0	0						-.0454545	-.2137767	-.6512969	-1.080745	.2529833

Here I wish to compare ROAs of observation 1 and 3 (route 1) with observation 4, 6 and 8, as they are active the year before and after 2008.
I’m pretty sure that if I wanted to just compare ROA09 I could write:

gen treatment = route_1 == 1
gen after_reform = 1 if ROA09 != .
gen interaction = treatment*after_reform
regr ROA09 treatment after_reform interaction

But how do I get all of the ROAs of the relevant years to compare in the regression?

Is there a problem if there is a different number of observations in control vs treated group? For example if the control groups goes bankrupt more often at the end of the dataset, there are more observations in the treated group than the control group.

After I have done this on all the 10 individual routes and control groups, I wish to do it on a national level. My problem here is that the companies are treated in different years, and therefore should be compared to control groups in those years. How do I do this? Or is it enough to have the dummy-variable stating that it is treated?

Sorry for the way too long post, I wanted to explain as much as possible so you understand my problem (hopefully) =)

Best regards,
Susanne Daae

Tags: None

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2081
#2

09 May 2015, 09:44

Susanne: If I understand correctly, you're making this too hard. You should not have all of those definitions of the dependent variable. There should be only one -- call it ROA. You should have this defined for every firm in every year. You need a line of data for each firm in each year. Then, you need the ROA variable and then dummy indicators for each of the 10 treatments. These should be set to one if a firm received that treatment in that specific year, and zero otherwise. But now that I look at it, I'm not sure you have 10 different treatments. It looks like you have two, route_1 and route_2. You do not consider it a separate treatment because it happened in different years.

Your Stata commands should look something like this.

Code:

xtset firmid year xtreg ROA route_1 route_2 i.year, fe cluster(firmid)

This controls for firm fixed effects and year fixed effects. route_1 and route_2 should be defined so that they are unity when the treatment is in effect.

I hope this helps. I have a similar examples in my book, "Econometric Analysis of Cross Section and Panel Data," MIT Press, 2010.
Comment
Susanne Daae

Join Date: Mar 2015

Posts: 22
#3

09 May 2015, 10:24

Thank you so much! I will try this and make a different line for each year for each ROA. I do have 10 different routes between 10 different locations, I just didn't include all of them in what I pasted here. The 2 routes I pasted just happened to come in the same year, many of them happen in different years. I also have 51 control-routes that does not have a direct rute between them.

Should the route dummies only be set to 1 the year the firm recieved treatment, or the years following as well, as they are then treated (I assume the opening of a direct route has benefits in all following years after a treatment)? You said that you do not consider it a separate treatment because it happened in different years, but isn't it a different treatment when it involves completely different firms in different locations?

In the location-specific analysis I want to compare only firms in one area with a direct route, to firms in the same area with a daughter company without a direct route, to look at economic effects on that area. In the national analysis I want to compare all treated firms to non-treated firms within the same years.

The code you wrote, is that for the DID-regression? Don't I need a dummy with

interaction = (route_1*ROA) + (route_2*ROA).....++ ?

and then do the regression

regr ROA route_1 route_2 route_3... interaction

Is that correct?

And then look at the interaction coefficient in the regression?

=)
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2081
#4

09 May 2015, 12:45

Okay, so you have 10 different treatments. A treatment gets turned on (is a one) for any period that a firms is subject to it. In many cases the exposure starts in a year and then stays for the rest of the sample. In this case, the dummy should be turned on for the remaining periods.

You definitely do not want to create the interactions above. You'd be regression ROA on functions of ROA, and this is meaningless. Have you looked at a basic DID description with multiple years?

You use ROA as your dependent variable, firm fixed effects, year fixed effects, and the 10 treatment dummies, defined to be one in any period where the route is in effect. That's it. The "interactions" that you seem confused about are taken care of in the definitions of treat_1, treat_2, and so on.

One can make it more complicated by using individual-specific trends, and different treatment effects for different number of years, but the basic analysis is often enough.
Comment
Susanne Daae

Join Date: Mar 2015

Posts: 22
#5

09 May 2015, 18:16

Ok thank you so much! I will try this tomorrow!

No I have not done a DID analysis before so I'm trying to learn it and do it in Stata. I have tried to pratice on a sample assignment analyzing unconditional difference-in-difference estimates of the effect of the 1993 EITC expansion on employment of single women. That code looked like this:'

gen anykids = (children >= 1)
gen post93 = (year >= 1994)
gen interaction = post93*anykids
reg work post93 anykids interaction

As this used 2 dummy variables and the interacton variable, I tried to go from there. I'm pretty new to Stata so I'm learning as I go =) But thank you so much, I will try this tomorrow =)
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2081
#6

09 May 2015, 20:34

Okay, but notice that all the interaction is doing is creating a new dummy variable. It is one for a woman who has at least one child post 1993. And I strongly suspect that was not true panel data but different cross sections, so it is impossible to include individual fixed effects. If the analysis was for more than two years, I would've included a full set of year dummies -- not just a post93 dummy.
Comment
Susanne Daae

Join Date: Mar 2015

Posts: 22
#7

10 May 2015, 06:49

Ah I think I managed to do it now! I think that like you said, the companies are either treated or not, just at different time periods, not 10 different treatments. I used

expand 10 to get one line per year (10 yrs data set)
gen year, added 2003-2012 to each company
gen ROA = ROA12 if year == 2012
replace ROA = ROA11 if year == 2011
.
.
.
replace ROA = ROA03 if year == 2003

drop if ROA == .
gen treated = 1 if year >= year_treated & sum_routes == 1 ((sum routes is 1 if the company is on a treated rote))
replace treated = 0 if treated == .

reg treated ROA

Is this pretty much what you meant I should do? =) Thank you so much for all your help, I am a mile further than I was yesterday!
Comment
Susanne Daae

Join Date: Mar 2015

Posts: 22
#8

10 May 2015, 06:54

By the way, what makes this a difference-in-difference analysis? And not just a normal regression?
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 9958
#9

10 May 2015, 11:29

Sussane: I cannot comment on how to carry forward your analysis since I do not fully comprehend what you want to do. However, it is always convenient to have your data in long form. A general comment: fixed effects suggested by Jeff gives similar results to DID and is much simpler to implement. For an intuition, look at the following link (page 12 onwards)

http://www2.warwick.ac.uk/fac/soc/ec...re_3_-_did.pdf

if you insist on a DID approach, I would recommend that you look at one of the more interesting papers that I have read lately: "How responsive is investment in schooling to changes in redistributive policies and in returns? (Abramitzky and Lavy, Econometrica 82(4), 2014, 1241-1272). Here is the link to the paper and the Stata do files and data (which are free to download from the Econometrica website).

http://web.stanford.edu/~ranabr/Abramitzky_Lavy.pdf
https://www.econometricsociety.org/p...redistributive

You should look at how they organize their data (browse the data editor), and how they implement the DID model specified in their paper, then adapt this to your situation. The relevant do file is the one relating to Table 3. Of course, all this requires that you invest quite a bit of time to understand. If you just have a minor project, then fixed effects is an option and there are plenty of examples around.
Comment
Susanne Daae

Join Date: Mar 2015

Posts: 22
#10

10 May 2015, 13:16

Thank you so much for additional insight! What I am trying to do in essence is:

If you have 3 areas, A,B and C. Then A and B gets a direct flight between them. Suppose there are mother companies in A that has either a daughter company in B or C. What I want to analyze is the effect of additional transportation opportunities and through that monitoring opportunities, of the mother companies in A.
The companies on either end of route_1 = AB are the treated firms, and the companies on either end of route_2 = AC are the control groups.

I assume the companies before the treatment are similar, and want to analyze if there is a difference after one group is treated.
Of course I have many routes and control groups and effects I want to analyze (it is a big project with 3 500 000 observations), but that is the essence of what I want to analyze and am trying to learn how to do. I have worked in Stata for about 2 months so I am still learning the language. I have been told that a DID-analysis is what I have to do.

I had looked at the presentation you sent me a link to already Andrew, but I will check out the paper as well!

I understand the logic and intuition behind the DID explained on page 16, but on page 17 of the presentation it sais:

The typical regression model that we estimate is:
Outcomeit = β1 + β2 Treati + β3 Postt + β4 (Treat * Post)it + ε
Treatment = a dummy if the observation is in the treatment group
Post = post_treatment dummy

This is what I talked about earlier that confuses me, this is a regression with 2 dummies, and a third part that is just a dummie x dummie. I do not know how to incorporate the dependent variable that I wish to do the regression on, or how to tell Stata what to do. I will look at the paper you posted, maybe that helps =)

I did a simple regression with ROA and the treatment dummy, and got promising results. However I need to do a DID…
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 9958
#11

10 May 2015, 17:31

Susanne: On page 16, the author is giving you the intuition of DID - and on page 17, he is actually showing you how this is implemented in a regression framework. You have a very well explained example in page 18. In the DID regression, you will always have the two dummies, and the DID estimate is the coefficient of the interaction term.

I am sure if you go through the paper and the do-files, the procedure will become clear to you: how to arrange the data and generate the indicator variables. Just post if you run into trouble.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9958

#12

10 May 2015, 22:17

Here is an illustration. Hopefully, you will be able to build on this. For simplicity, I suppose that there are 10 firms and initially, in the year 1998, there is a direct route between A and C for all firms (so that the firms are observationally equivalent in the pre-period). I identify these firms with id=1 through id=10. Then, suppose that in 1999, there is a new route from A to B, but only for the firms with id=1 through id=5. Suppose that I want to use the DID estimator to investigate whether the introduction of the new route ("the treatment") had an effect on ROE of the firms. Firms in the control group (id=6 through id=10) still continue with the route A to C in the post period. Again, for simplicity, I assume my pre-period is 1998 and my post-period is 2000, but you can include multiple years.

Code:

input ROA id  str8 route  year  
1.775622   1   "A TO C"   1998
3.83331    2   "A TO C"   1998
-5.210526  3   "A TO C"   1998
1.478725   4   "A TO C"   1998
-13.73461  5   "A TO C"   1998  
-8.754751  6   "A TO C"   1998
-2.822808  7   "A TO C"   1998
-.3456052  8   "A TO C"   1998
4.453937   9   "A TO C"   1998
-6.672886  10  "A TO C"   1998    
9.433187   1   "A TO B"   2000
8.438064   2   "A TO B"   2000
9.211845   3   "A TO B"   2000
8.478987   4   "A TO B"   2000
17.824034   5   "A TO B"   2000
1.470532   6   "A TO C"   2000
-8.817384  7   "A TO C"   2000
2.556952   8   "A TO C"   2000
-4.566959  9   "A TO C"   2000
3.711421   10  "A TO C"   2000
end

First, I will create the dummies for treatment, post-period, and the interaction term

Code:

*generate a treatment dummy: =1 if firm is in the treatment group and 0  otherwise
gen treatment= (id<6)

*generate a post-period dummy: =1 if year occurs after the treatment date and 0 otherwise
gen post= (year==2000)

*generate interaction: = 1 if observation is of a firm in the treatment group and occurs in the post period and 0 otherwise
gen interaction= treatment*post

Here is the result

Code:

. list

  +-------------------------------------------------------------+
     |       ROA   id    route   year   treatm~t   post   intera~n |
     |-------------------------------------------------------------|
  1. |  1.775622    1   A TO C   1998          1      0          0 |
  2. |   3.83331    2   A TO C   1998          1      0          0 |
  3. | -5.210526    3   A TO C   1998          1      0          0 |
  4. |  1.478725    4   A TO C   1998          1      0          0 |
  5. | -13.73461    5   A TO C   1998          1      0          0 |
     |-------------------------------------------------------------|
  6. | -8.754751    6   A TO C   1998          0      0          0 |
  7. | -2.822808    7   A TO C   1998          0      0          0 |
  8. | -.3456052    8   A TO C   1998          0      0          0 |
  9. |  4.453937    9   A TO C   1998          0      0          0 |
 10. | -6.672886   10   A TO C   1998          0      0          0 |
     |-------------------------------------------------------------|
 11. |  9.433187    1   A TO B   2000          1      1          1 |
 12. |  8.438064    2   A TO B   2000          1      1          1 |
 13. |  9.211845    3   A TO B   2000          1      1          1 |
 14. |  8.478987    4   A TO B   2000          1      1          1 |
 15. |  17.82403    5   A TO B   2000          1      1          1 |
     |-------------------------------------------------------------|
 16. |  1.470532    6   A TO C   2000          0      1          0 |
 17. | -8.817384    7   A TO C   2000          0      1          0 |
 18. |  2.556952    8   A TO C   2000          0      1          0 |
 19. | -4.566959    9   A TO C   2000          0      1          0 |
 20. |  3.711421   10   A TO C   2000          0      1          0 |
     +-------------------------------------------------------------+

Simple DID is just OLS with the interaction term

Code:


. reg  ROA treatment post interaction

      Source |       SS       df       MS              Number of obs =      20
-------------+------------------------------           F(  3,    16) =    6.67
       Model |  620.875763     3  206.958588           Prob > F      =  0.0039
    Residual |  496.123522    16  31.0077201           R-squared     =  0.5558
-------------+------------------------------           Adj R-squared =  0.4726
       Total |  1116.99928    19   58.789436           Root MSE      =  5.5685

------------------------------------------------------------------------------
         ROA |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   treatment |   .4569269   3.521802     0.13   0.898    -7.008959    7.922813
        post |   1.699335   3.521802     0.48   0.636    -5.766551    9.165221
 interaction |   11.34938    4.98058     2.28   0.037     .7910261    21.90774
       _cons |  -2.828423    2.49029    -1.14   0.273    -8.107602    2.450756
------------------------------------------------------------------------------

The coefficient of interest is in bold. As you can see, the interaction is significant 5% level. It can be easily shown that the interaction coefficient is the DID estimate [(mean ROA treated post- mean ROA treated pre) - (mean ROA not treated post- mean ROA not treated pre)]

Code:

*sum ROA treated post
. sum ROA if  treatment==1 &  post==1

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
         ROA |         5    10.67722    4.019264   8.438064   17.82403

*sum ROA treated pre

. sum ROA if  treatment==1 &  post==0

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
         ROA |         5   -2.371496     7.20595  -13.73461    3.83331


*sum ROA not treated post
. sum ROA if  treatment==0 &  post==1

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
         ROA |         5   -1.129088    5.355004  -8.817384   3.711421


*sum ROA not treated pre

. sum ROA if  treatment==0 &  post==0

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
         ROA |         5   -2.828423     5.22251  -8.754751   4.453937


*Compute DID estimate
scalar DID= (10.67722- (-2.371496)) - (-1.129088- (-2.828423))

. di DID
11.349381

Last edited by Andrew Musau; 10 May 2015, 22:41.

Comment

Susanne Daae

Join Date: Mar 2015

Posts: 22
#13

11 May 2015, 04:31

Thank you so so much, it is completely clear to me now! It was the code

reg ROA treatment post interaction

I didn't quite understand because I have never done a regression in Stata before so I thought it was more hocus pocus to the code.

I have now winsorized my ROAs (thanks to another's great question and answers I found on statalist) to the 5-percentile, made the additional dummies you said and then did the DID-regression. Unfortunately my interaction is not statistically significant to the 5% level, when I do a DID-regression. It is significant when I only do a simple regression on

Reg interaction

But not when I adjust for post and treatment. I guess I need to include firm-specific effects and regions to my overall analysis. I will not try the same on the regional documents, and hopefully it will gain statistically significant results.

But thank you all so much for your help! I have asked one question here previously and got excellent answers then as well, you guys are great!!

Best regards,
Susanne Daae
Comment
Susanne Daae

Join Date: Mar 2015

Posts: 22
#14

11 May 2015, 04:33

Sorry I meant

reg ROA interaction

when I talked about the simple regression that was statistically significant.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2081
#15

11 May 2015, 05:27

I think there's some confusion about what DID means in general. You cannot, with multiple years and treatments taking effect in different periods for different units, write it as a simple difference of means. That's why insisting on the interpretation as a coefficient on the simple interaction is throwing you off. Controlling for enough stuff, you want to know the counterfactual: what is the difference in the treated and nontreated state? I thought you might be treating each time period as a different treatment. There is a way to think about this, but the easiest thing is to do what you have done: define a single treatment indicator. This happens all the time in panel data policy analysis. Don't get so hung up on DID! If I want to test the effect of a new traffic law in the United States, I define an indicator equal to one if state i in year t was subject to the law. Period. That this can be written as an interaction of two dummies is not important.

But your simple regression above is not nearly convincing. As I said, you need to include unit fixed effects -- which, in the two period case, is the same as putting in the "treatment" dummy on its own but is much better in general. You need to have a full set of year dummies.

When T = 2 and there is a pre-treatment period for all units, the simple DID is the same as fixed effects estimation with a time dummy and the so-called interaction (the treatment dummy). Last time I will write it:

xtset id year
xtreg ROA i.year interaction, fe cluster(id)

I would just call "interaction" something like "treat."
Comment

Announcement

How to use difference in difference with many yearly observations and dummy-variables?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment