Data generation that satisfies different coefficients for interaction term in diff-in-diff regression with and without time dummies?

Gokhan Dilek

Join Date: Oct 2024

Posts: 1
#1

Data generation that satisfies different coefficients for interaction term in diff-in-diff regression with and without time dummies?

18 Oct 2024, 08:50

Hello all,

I tried many times to construct a panel dataset in which adding time dummies into the regression changes the interaction term’s coefficient. To elaborate, imagine a panel dataset where there are following variables: outcome, year, unit_id, treated dummy, post dummy, interaction (treatedXpost) dummy.
Hence, running this regression:
Outcome = α + β interaction + λ treated + γ post + e
And this regression:
Outcome = α + β interaction + λ treated + γ post + i.year + e
And this one:
Outcome = α + β interaction + λ treated + γ post + i.year + i.district + e
Yield the same β coefficient.
I tried several data generation codes in Stata, including the one that defines technology which changes over time and affects the outcome differently for control and treated units. I copy and paste the one I used last time here as an example:
“
set seed 12345
local n_districts = 50
local n_years = 15
set obs `= `n_districts' * `n_years''
gen district = .
gen year = .

* Fill in district and year for each observation
forval d = 1/`n_districts' {
forval y = 2006/2020 {
replace district = `d' in `= (`d' - 1) * `n_years' + (`y' - 2005)'
replace year = `y' in `= (`d' - 1) * `n_years' + (`y' - 2005)'
}
}

gen treated = 0
* Randomly assign 20 out of 50 districts to be treated
gen randnum = runiform()
sort randnum
replace treated = 1 if district <= 20 /* First 20 districts are treated */

gen post = (year >= 2015)
gen outcome = .

* Pre-treatment period (2006-2014): Parallel increasing trends, with treated having higher baseline
gen base_control = 50 + runiform(0, 5) // Baseline outcome level for control
gen base_treated = 65 + runiform(0, 5) // Higher baseline for treated units

gen trend_control = 2 + runiform(0, 0.5) // Trend for control units
gen trend_treated = 2 + runiform(0, 0.5) // Parallel trend for treated units

* Add random noise
gen noise = rnormal(0, 10) // Increased noise with a standard deviation of 10

replace outcome = base_control + trend_control*(year - 2005) + noise if treated == 0 & year < 2015
replace outcome = base_treated + trend_treated*(year - 2005) + noise if treated == 1 & year < 2015

bysort district: egen outcome_2014_control = mean(cond(year == 2014 & treated == 0, outcome, .))
replace outcome = outcome_2014_control + rnormal(0, 5) if treated == 0 & year >= 2015
drop outcome_2014_control

replace trend_treated = 4 + runiform(0, 0.5) if treated == 1 & year >= 2015 // Treated accelerates further
replace outcome = base_treated + trend_treated*(year - 2005) + noise if treated == 1 & year >= 2015

* Create technology variable that changes over time and repeats every 15 years
gen technology = 1 + 0.2 * (year - 2006) // A simple increasing technology variable from 1 to 3 over time

* Adjust the impact of technology on the outcome
* For control units, technology has a smaller effect (multiplied by 10)
replace outcome = outcome + 10 * technology if treated == 0

* For treated units, technology has 3 times the effect (multiplied by 30)
replace outcome = outcome + 30 * technology if treated == 1

* Drop intermediate variables
drop base_control base_treated trend_control trend_treated randnum noise

sort district year

xtset district year

gen interaction=treated*post
“
My question is: What kind of data generation is needed so that we can get different coefficients for interaction term when running diff-in-diff regression with and without time dummies?
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3207
#2

21 Oct 2024, 15:06

treated is collinear with i.district and post is collinear with i.year. You are capturing i.year with post. In fact, treated and post are estimated only because of the way you've ordered the variables (the coefficients are just one of the i.district or i.year values). Use reghdfe instead and you'll see they are not estimated.

If you want the DID coefficient to be different with i.year included, leave post out of the first regression so that you do not account for time.
Comment

Announcement

Data generation that satisfies different coefficients for interaction term in diff-in-diff regression with and without time dummies?

Comment