Running csdid with repeated cross-section data - Issue of control variables

Sarita Vaswani

Join Date: Sep 2023

Posts: 16
#1

Running csdid with repeated cross-section data - Issue of control variables

27 Dec 2024, 20:59

Hi everyone,

I am looking at the impact of a policy on the income of individuals. The policy was implemented at the state level in a staggered way, and my data is cross-sectional survey data. So, every year, I observe different individuals in the state. I would like to run csdid but I am having trouble understanding how to deal with time-constant control variables. Controls like educational level are really important, but the help file says - "be careful of controlling for characteristics that are either time constant (e.g., sex or race), or for pretreatment characteristics." I am not sure whether I can add these or not then.
Tags: None
FernandoRios

Join Date: Apr 2014

Posts: 2469
#2

28 Dec 2024, 03:08

You can add them, but need to be aware of the assumptions requiered and implications if those fail.
in RC every variables is effectively time varying. So the next assumption is for them to be stationary
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2162
#3

28 Dec 2024, 09:19

Let me just add a couple of comments. I think it's useful to distinguish between variables that are dated prior to the intervention and those that may actually change, for an individual, due to the intervention. In a panel data set, these are relatively easy to separate. If I'm looking at adults well into their work history, and a variable is highest grade completed by age 25, then I can assume this is not affected by any intervention. But if I'm studying a job training program, I don't necessarily want to control for current marital status. In repeated cross sections, the issue is often compositional effects. For example, if a state expands Medicaid, it might get an influx of less healthy people from other areas. Then, controlling for pre-intervention health can create issues. But, it can also help address compositional effects. That's why it's tricky.

Fernando is correct that, mechanically, all covariates in an RC are effectively time-varying. But it is useful to break them into truly time-varying characteristics and those that may change for an individual due to the intervention. We usually want to avoid the latter, whether it is panel data or RC. As Fernando implies, decisions on variables dated pre-intervention are generally tricky. If the population is stable -- stationary -- then it becomes easier to justify.
Comment
Sarita Vaswani

Join Date: Sep 2023

Posts: 16
#4

28 Dec 2024, 12:18

Thank you, Fernando and Professor Wooldridge. This is really helpful.
Comment
Daisy Dang

Join Date: Dec 2023

Posts: 23
#5

24 Feb 2025, 01:27

Dear Prof. @Jeff Wooldridge and @ FernandoRios,

I have a related question to this topic, specifically difference between reghdfe, csdid, and jwdid package when using repeated cross-section data and adding control variables.
I understand they are different in many aspects, but the results are too much different in my case. (I see many papers they provide estimates of those 3 packages which are quite comparable)

I study the effect of a support program at county level on 2 (individual) outcomes, using repeated cross-section data of individuals above 20 years old.

#

Code:

reghdfe job sup_prog i.year, absorb(county) cluster(county) csdid job, time(year) gvar(gvar) method(reg) agg(event) cluster(county) notyet long2 jwdid job, ivar(county) tvar(year) gvar(gvar)

job is a dummy indicating the probability of having a job.
The estimate using jwdid is higher than the one using reghdfe, both are insignificant. But the one using csdid is nearly double the one using either jwdid or reghdfe, and significant at 1% level.

#

Code:

reghdfe job_hi sup_prog i.year control_var*, absorb(county) cluster(county) csdid job_hi control_var*, time(year) gvar(gvar) method(reg) agg(event) cluster(county) notyet long2 jwdid job_hi control_var*, ivar(county) tvar(year) gvar(gvar)

job_hi is a dummy, indicating the probability to get a high paid job.
control_var* includes age, age squared, education, and sex. (Those controls are less likely to be affected by the treatment, since the sample includes individuals >20 yrs old)

For this outcome, without controls, the estimates using either jwdid and csdid are comparable (nearly double the one using reghdfe)
With controls, the estimates using csdid is nearly double the one using jwdid, and 4 times higher than reghdfe. Only the estimate from csdid is significant at 5% level.

I wonder
1, what makes them very different like that?
2, Why adding controls to csdid make the estimates and the significance change that much?
3, Which package is most suitable, especially when adding controls and using RC data?

Could you share some insights on those?
Thanks in advance.
Comment

Announcement

Running csdid with repeated cross-section data - Issue of control variables

Comment

Comment

Comment

Comment