control group selection for DiD

Vincent Rowold

Join Date: Jun 2020

Posts: 9
#1

control group selection for DiD

07 Oct 2021, 08:09

Hello!
I would like to estimate the effect of a flood event on land prices. I have decided to use a difference-in-difference design, with a treatment group containing observed land transactions before and after the event (repeated cross-section) inside the floodplain, and a control group outside the floodplain. In addition to the price per square meter, the data also include various details about each land unit (distance to the city center, distance to the river, square meters...). The pre-treatment price trend is quite different for the two groups, so I want to restrict the control group to observations where it is more plausible to assume a parallel trend. Since the number of possible observations for the control group is large (45000 observations), I want to select only those that might closely resemble my counterfactual.
I have read a lot about matching techniques, but I am struggeling to implement them properly.
My approach was the following:

1. I performed PSM with different types (NN, Calapier...) and chose the one where the covariates are best balanced.
2. I created a dataset with only the matched data and estimated the DiD.

My questions are:
1. is it ok to use covariates that correlate with the dependent variable and treamtnet status?
2.Can I include the covariates in the DiD regression with the matched data?
3.What problems may arise with cross-sectional data? Should I match the pre and post treamtent observations separately?
4. Are there other (better) techniques for control group selection?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

07 Oct 2021, 10:11

1. Yes. In fact, that is the whole point of matching--to reduce the effect of confounding variables. What you need to be careful about, however, is that you should not match on covariates that are on the causal path between the treatment status and the dependent variable.

2. That depends on which kind of matching you are using. With caliper matching, you can include them. However, the effects of any variables that were matched exactly become inestimable. So you cannot include them. (More precisely, if you are doing a proper matched-pair analysis of the data, if you try to include them, Stata will recognize that their effects are inestimable and will omit them.)

3. You definitely should not match the pre- and post- treatment observations separately. You should identify a control group that is reasonably well matched to the treatment group based exclusively on pre-treatment variables. Then you analyze with that treatment and control group. The treatment and control groups should be the same units of analysis in both the pre- and post-treatment period (except that some units of analysis observed pre-treatment might be unavailable for follow-up in the post-treatment period.)

4. Alternative approaches include not matching anything and just adjusting for covariates. You are in a fortunate position of having a large pool of potential controls to draw on, but in many situations it is only possible to match on a small number of covariates even though many important covariates are extant. In that situation, one might forego matching altogether, or match only on a few of the most important confounding variables and then adjust for the others by including them in the regression model. But in your situation, if you are able to match an adequate number of your cases, I think matching is the better approach.

Be sure you use an appropriate matched-pair analysis. A regression that does not appropriately reflect the matching will produce incorrect results.
Comment
Guest
#3

14 Jul 2023, 09:35

Originally posted by Clyde Schechter View Post

1. Yes. In fact, that is the whole point of matching--to reduce the effect of confounding variables. What you need to be careful about, however, is that you should not match on covariates that are on the causal path between the treatment status and the dependent variable.

2. That depends on which kind of matching you are using. With caliper matching, you can include them. However, the effects of any variables that were matched exactly become inestimable. So you cannot include them. (More precisely, if you are doing a proper matched-pair analysis of the data, if you try to include them, Stata will recognize that their effects are inestimable and will omit them.)

3. You definitely should not match the pre- and post- treatment observations separately. You should identify a control group that is reasonably well matched to the treatment group based exclusively on pre-treatment variables. Then you analyze with that treatment and control group. The treatment and control groups should be the same units of analysis in both the pre- and post-treatment period (except that some units of analysis observed pre-treatment might be unavailable for follow-up in the post-treatment period.)

4. Alternative approaches include not matching anything and just adjusting for covariates. You are in a fortunate position of having a large pool of potential controls to draw on, but in many situations it is only possible to match on a small number of covariates even though many important covariates are extant. In that situation, one might forego matching altogether, or match only on a few of the most important confounding variables and then adjust for the others by including them in the regression model. But in your situation, if you are able to match an adequate number of your cases, I think matching is the better approach.

Be sure you use an appropriate matched-pair analysis. A regression that does not appropriately reflect the matching will produce incorrect results.

I have a similar problem, and I am not very experienced with stata.
I am comparing sustainability scores of companies from two countries. There is an exogenous event in 2017 where one country is affected and the other is not. So I want to see the impact of that event on the sust scores. For that and I use the DID but I have problem with defining the control group.

Code:

[CODE]
gen id = _n order id reshape long x , i(id) j(year) xtset id encode country, generate (country_n) encode companyname, generate (company_n) egen panel_id = group (year company_n) xtset panel_id g lcompanysize = ln(companysize g post = year>=2017 g treated = country == "treatedcountry" g posttreated = post * treated xtdidregress (esgscore) (post), group(treated) time(year)
[/CODE]

Error that I get: invalid group specification None of the groups defined by treated is a control. I suspect there is a problem with defining the treatment and control or the xtdidregress

Last edited by Rina Vel; 14 Jul 2023, 09:38.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#4

14 Jul 2023, 19:23

I have a paper on this that I'm gonna submit to JPAM. I'm my opinion, you can only (at this stage of Stata) really argue for the similarities of your control units based off of conceptual grounds and realized covariate similarities. However, this isn't ideal since it's all very very VERY arbitrary. If we have one treated unit, and 80 other control units, where say 30 of them are located along the same coastline, we could restrict our donor pool to those 30, but this is super arbitrary. What if your treated unit is unique among those 30 units (let's say, it's a particularly popular city for tourism), such that some of the 30 may be good matches, but some of the other 50 might be viable candidates too.

How do you go about selecting the RIGHT donors then? And how do you know which ones to go with? Well, i would advocate for using something like functional PCA analysis and kmeans clustering. Can't do it in Stata! You'd need the Python code and run that through Stata. I know I sorta rambled there, but to me that's the story, imperfect as it is
Comment

Announcement

control group selection for DiD

Comment

Comment

Comment