Wonder how I should set my control and my treatment groups for a diff-in-diff regression?

Krishna Mantirraju

Join Date: Apr 2019

Posts: 6
#1

Wonder how I should set my control and my treatment groups for a diff-in-diff regression?

16 Apr 2019, 16:31

Hello!

I'm looking to run my first diff-in-diff regression for my research project that talks about the effects of supply deregulation on oil prices. My control group is obviously the states that didn't undergo supply deregulation, and my treatment group is the group of states that did. I will then analyze the effects on gas prices sold to consumers to determine if supply deregulation increased or decreased prices. I've already seen that all of the DD assumptions check out; ie parallel trends, etc.

However, there's a couple issues that I can't seem to get over when setting up for a DD regression:

Different states deregulated at different times, so I should definitely run different regressions for those states at those different years. However, should I pair up the US states on a one-to-one basis at those respective years(ie pair up Michigan which deregulated in 2008 and Nevada which hasn't undergone any deregulation and analyze prices since then), run each individual treated state against the mean price of the aggregate control group, or is there something else I should do?

I understand the basis of a DD regression, but I'm just confused as to the scope of my control and treatment groups. Any and all advice would be sincerely appreciated.
Tags: None

1 like
Clyde Schechter

Join Date: Apr 2014

Posts: 29950
#2

16 Apr 2019, 19:50

A matched pairs approach is one possibility, though it is tricky to carry out. Simpler is to use generalized diff-in-diff estimation instead.

You don't show any example data, so I'll make up the names of some variables that I assume exist in your data (if they don't you should create them). You need a variable tracking the years. Let's just call it year. You need another variable, which I'll call deregulated, which is set to 1 in those observations where the state had deregulated by that year, 0 in all other observations. Otherwise put, this variable, deregulated, is actually like a treatment#pre_post interaction term. It is 1 in those observations where the treatment is in effect, and 0 in all others. You need a variable indicating the state, and I'll call that state.

Code:

xtset state year xtreg oil-price i.deregulated i.year, fe

That's the bare bones approach. You may want to use cluster robust standard errors, and there may be covariates you want to add. But that's the core of the generalized DID approach. The coefficient of variable deregulated is the generalized diff-in-diff estimator of the effect of deregulation on oil prices. There is no variable designating the treatment vs control group status, nor any pre-post variable. The information that would be carried by those in a classical DID analysis is carried instead by the year and state fixed effects.

See https://www.annualreviews.org/doi/pd...-040617-013507 for more information.
1 like
Comment
Krishna Mantirraju

Join Date: Apr 2019

Posts: 6
#3

16 Apr 2019, 22:41

Thank you so much for your answer Clyde!

So just to be clear, what you're saying is assimilating(I have one spreadsheet with year and price for each of my states)all of my data per state into one spreadsheet then setting up some dummy variables(deregulated, year) in order to specify if a state is in a post-deregulation year. That will then be the basis of my generalized DiD regression.

If I was to go about using a matched-pairs approach, which could be a bit easier as far as coding goes because different states deregulated at different times, I could pick my pairs based on the closeness of the oil prices before a state's deregulation year. I could then run a regression for each pair, setting up two dummy variables as before.

What would you suggest in that regard?

I'd also love to put some example data up in a bit, but I'm in the process of organizing what I have so far. I'll probably post an example tomorrow morning.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4373
#4

17 Apr 2019, 05:21

Originally posted by Clyde Schechter View Post

See https://www.annualreviews.org/doi/pd...-040617-013507 for more information.

Clyde, that's a great introductory review. Thank you for the pointer.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29950
#5

17 Apr 2019, 08:59

Re #3. First, I don't think you should do anything in a spreadsheet. For serious work, you need an audit trail of every step. You should import your data as they came to you from spreadsheets into Stata and then everything from there should be done in Stata--with the code kept in do-files and the output in log or smcl files. Even the importation of the data from spreadsheets into Stata should be done using the -import excel- command or using Stat Transfer or a similar tool that leaves an audit trail. Don't copy/paste from Excel into Stata's data editor--there's no audit trail of what happened, and it's easy to mistakenly omit some rows or columns.

You may be the only person I've ever "met" who thinks a matched-pairs approach is simpler! But I suppose simplicity is in the eye of the beholder. That kind of matching might be done using the -calipmatch- program, available from SSC. Or you can hand-code the matching if it's going to be more complicated than what -calipmatch- can do for you. (There are many examples of matching code, from the very simple to the very complicated, on the Forum.) The next key step to prepare for diff-in-diff analysis is to attribute the start date of the treated member of the pair to the otherwise undefined start date of the untreated member, and then define your pre-post variable with respect to that date.

There are other problems with this approach. In economics and econometrics, there is a fairly strong preference for fixed-effects estimation in most situations. But when you have matched pair data, the data is inherently hierarchical with 3 levels: replications over time within states (or whatever the unit of analysis in your study is), and those now within matched pairs. There are no three-level fixed-effects estimators. So you have to then choose which way you wish to mangle the structure of your data for analysis, or risk disapproval for using random-effects estimation. Of course, my own approach is to use the multi-level random effects modeling and incorporate as many covariates as I reasonably can to reduce confounding bias as much as possible.

Probably the simplest way to get a fixed effects estimation is to flatten the matched pair structure to one observation per pair, and using the ratio or difference, as appropriate to the variable involved, of the pair's outcomes as the dependent variable in the model. The drawbacks to this approach are that there may be other important covariates where the matched pairs disagree and it isn't always clear how to handle those variables, and using the ratio or difference of the pair's outcomes as your dependent variable imposes a very stringent constraint on the relationship of those outcomes within the matched pairs. Assuming you use this approach, the key variable now becomes simply the pre-post variable, defined with respect to the start year for the treated member of the pair. The treatment effect is estimated as the coefficient of the pre-post variable in this model.

Another approach is to do the usual DID estimation, with random effects for the matched pairs, but not for the states. Instead, you can take the pre-post, and treatment#pre-post interaction variables and split them up into within-state and between-state variables, like this:

Code:

gen interaction = 1.treatment#1.pre_post foreach v of varlist pre_post interaction { by state, sort: egen `v'_mean = mean(`v') gen `v'_within =`v' - `v'_mean }

and then use -treatment pre_post_mean pre_post_within interaction_mean interaction_within- as your key predictors in the analysis. The coefficients for the *_within variables are fixed-effects-like estimators for the DID model. (You do not distinguish mean and within versions of treatment because it is a purely between-states variable and treatment_within would be a constant 0 if you tried to calculate it.)

So those are some ways you can proceed if you want to do matched pairs. Do you still think matched pairs is simpler than generalized DID?
Comment
Krishna Mantirraju

Join Date: Apr 2019

Posts: 6
#6

21 Apr 2019, 17:15

Sorry about the delay, I got a bit sick and I forgot to upload an example of some data as I promised.
Here's some examples: North Carolina.xlsx Colorado.xlsx

Like you recommended, I'll go ahead and leave the audit trail just to be safe and do all of my work on Stata.

What I was thinking about doing was matching states based on the average consumer price of gasoline the year before each a state decided to regulate, as the consumer price takes into account costs of transportation, region, and a lot of other covariates I might miss. I was also thinking about setting the fixed effects estimation as the difference between the two prices, or would that not work?

I'll also go ahead and check out a generalized DID as well, but I wanted to hear your thoughts on my idea of a matched pairs regression.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29950
#7

21 Apr 2019, 17:27

As for whether matching on the price of gasoline the year before regulation is a good way to match, I really can't say. Your explanation sounds good to me, but I am not an economist and don't know whether what you say is correct, whether it is an adequate basis for matching, or whether it even might constitute over-matching for your purposes. Those judgments depend on a knowledge of economics, and of this particular area of economics, that I simply don't have.

Again, from a purely statistical perspective, a generalized DID analysis looks more attractive to me, and it also seems, to me, much simpler to implement. But if the matching you propose is a really good one, then that might well outweigh these purely statistical and ease-of-implementation concerns.

There are a number of economists who participate in this Forum regularly, and perhaps one of them will comment.
Comment
Krishna Mantirraju

Join Date: Apr 2019

Posts: 6
#8

21 Apr 2019, 18:16

Fair enough. I'll be sure to ask around both here and at my university for some economic perspectives on this project.

Thank you so much for your suggestions and inputs Clyde, you're a real lifesaver.
Comment

Announcement

Wonder how I should set my control and my treatment groups for a diff-in-diff regression?

Comment

Comment

Comment

Comment

Comment

Comment

Comment