Hello all,
I tried many times to construct a panel dataset in which adding time dummies into the regression changes the interaction term’s coefficient. To elaborate, imagine a panel dataset where there are following variables: outcome, year, unit_id, treated dummy, post dummy, interaction (treatedXpost) dummy.
Hence, running this regression:
Outcome = α + β interaction + λ treated + γ post + e
And this regression:
Outcome = α + β interaction + λ treated + γ post + i.year + e
And this one:
Outcome = α + β interaction + λ treated + γ post + i.year + i.district + e
Yield the same β coefficient.
I tried several data generation codes in Stata, including the one that defines technology which changes over time and affects the outcome differently for control and treated units. I copy and paste the one I used last time here as an example:
“
set seed 12345
local n_districts = 50
local n_years = 15
set obs `= `n_districts' * `n_years''
gen district = .
gen year = .
* Fill in district and year for each observation
forval d = 1/`n_districts' {
forval y = 2006/2020 {
replace district = `d' in `= (`d' - 1) * `n_years' + (`y' - 2005)'
replace year = `y' in `= (`d' - 1) * `n_years' + (`y' - 2005)'
}
}
gen treated = 0
* Randomly assign 20 out of 50 districts to be treated
gen randnum = runiform()
sort randnum
replace treated = 1 if district <= 20 /* First 20 districts are treated */
gen post = (year >= 2015)
gen outcome = .
* Pre-treatment period (2006-2014): Parallel increasing trends, with treated having higher baseline
gen base_control = 50 + runiform(0, 5) // Baseline outcome level for control
gen base_treated = 65 + runiform(0, 5) // Higher baseline for treated units
gen trend_control = 2 + runiform(0, 0.5) // Trend for control units
gen trend_treated = 2 + runiform(0, 0.5) // Parallel trend for treated units
* Add random noise
gen noise = rnormal(0, 10) // Increased noise with a standard deviation of 10
replace outcome = base_control + trend_control*(year - 2005) + noise if treated == 0 & year < 2015
replace outcome = base_treated + trend_treated*(year - 2005) + noise if treated == 1 & year < 2015
bysort district: egen outcome_2014_control = mean(cond(year == 2014 & treated == 0, outcome, .))
replace outcome = outcome_2014_control + rnormal(0, 5) if treated == 0 & year >= 2015
drop outcome_2014_control
replace trend_treated = 4 + runiform(0, 0.5) if treated == 1 & year >= 2015 // Treated accelerates further
replace outcome = base_treated + trend_treated*(year - 2005) + noise if treated == 1 & year >= 2015
* Create technology variable that changes over time and repeats every 15 years
gen technology = 1 + 0.2 * (year - 2006) // A simple increasing technology variable from 1 to 3 over time
* Adjust the impact of technology on the outcome
* For control units, technology has a smaller effect (multiplied by 10)
replace outcome = outcome + 10 * technology if treated == 0
* For treated units, technology has 3 times the effect (multiplied by 30)
replace outcome = outcome + 30 * technology if treated == 1
* Drop intermediate variables
drop base_control base_treated trend_control trend_treated randnum noise
sort district year
xtset district year
gen interaction=treated*post
“
My question is: What kind of data generation is needed so that we can get different coefficients for interaction term when running diff-in-diff regression with and without time dummies?
I tried many times to construct a panel dataset in which adding time dummies into the regression changes the interaction term’s coefficient. To elaborate, imagine a panel dataset where there are following variables: outcome, year, unit_id, treated dummy, post dummy, interaction (treatedXpost) dummy.
Hence, running this regression:
Outcome = α + β interaction + λ treated + γ post + e
And this regression:
Outcome = α + β interaction + λ treated + γ post + i.year + e
And this one:
Outcome = α + β interaction + λ treated + γ post + i.year + i.district + e
Yield the same β coefficient.
I tried several data generation codes in Stata, including the one that defines technology which changes over time and affects the outcome differently for control and treated units. I copy and paste the one I used last time here as an example:
“
set seed 12345
local n_districts = 50
local n_years = 15
set obs `= `n_districts' * `n_years''
gen district = .
gen year = .
* Fill in district and year for each observation
forval d = 1/`n_districts' {
forval y = 2006/2020 {
replace district = `d' in `= (`d' - 1) * `n_years' + (`y' - 2005)'
replace year = `y' in `= (`d' - 1) * `n_years' + (`y' - 2005)'
}
}
gen treated = 0
* Randomly assign 20 out of 50 districts to be treated
gen randnum = runiform()
sort randnum
replace treated = 1 if district <= 20 /* First 20 districts are treated */
gen post = (year >= 2015)
gen outcome = .
* Pre-treatment period (2006-2014): Parallel increasing trends, with treated having higher baseline
gen base_control = 50 + runiform(0, 5) // Baseline outcome level for control
gen base_treated = 65 + runiform(0, 5) // Higher baseline for treated units
gen trend_control = 2 + runiform(0, 0.5) // Trend for control units
gen trend_treated = 2 + runiform(0, 0.5) // Parallel trend for treated units
* Add random noise
gen noise = rnormal(0, 10) // Increased noise with a standard deviation of 10
replace outcome = base_control + trend_control*(year - 2005) + noise if treated == 0 & year < 2015
replace outcome = base_treated + trend_treated*(year - 2005) + noise if treated == 1 & year < 2015
bysort district: egen outcome_2014_control = mean(cond(year == 2014 & treated == 0, outcome, .))
replace outcome = outcome_2014_control + rnormal(0, 5) if treated == 0 & year >= 2015
drop outcome_2014_control
replace trend_treated = 4 + runiform(0, 0.5) if treated == 1 & year >= 2015 // Treated accelerates further
replace outcome = base_treated + trend_treated*(year - 2005) + noise if treated == 1 & year >= 2015
* Create technology variable that changes over time and repeats every 15 years
gen technology = 1 + 0.2 * (year - 2006) // A simple increasing technology variable from 1 to 3 over time
* Adjust the impact of technology on the outcome
* For control units, technology has a smaller effect (multiplied by 10)
replace outcome = outcome + 10 * technology if treated == 0
* For treated units, technology has 3 times the effect (multiplied by 30)
replace outcome = outcome + 30 * technology if treated == 1
* Drop intermediate variables
drop base_control base_treated trend_control trend_treated randnum noise
sort district year
xtset district year
gen interaction=treated*post
“
My question is: What kind of data generation is needed so that we can get different coefficients for interaction term when running diff-in-diff regression with and without time dummies?
Comment