Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data generation that satisfies different coefficients for interaction term in diff-in-diff regression with and without time dummies?

    Hello all,

    I tried many times to construct a panel dataset in which adding time dummies into the regression changes the interaction term’s coefficient. To elaborate, imagine a panel dataset where there are following variables: outcome, year, unit_id, treated dummy, post dummy, interaction (treatedXpost) dummy.
    Hence, running this regression:
    Outcome = α + β interaction + λ treated + γ post + e
    And this regression:
    Outcome = α + β interaction + λ treated + γ post + i.year + e
    And this one:
    Outcome = α + β interaction + λ treated + γ post + i.year + i.district + e
    Yield the same β coefficient.
    I tried several data generation codes in Stata, including the one that defines technology which changes over time and affects the outcome differently for control and treated units. I copy and paste the one I used last time here as an example:

    set seed 12345
    local n_districts = 50
    local n_years = 15
    set obs `= `n_districts' * `n_years''
    gen district = .
    gen year = .

    * Fill in district and year for each observation
    forval d = 1/`n_districts' {
    forval y = 2006/2020 {
    replace district = `d' in `= (`d' - 1) * `n_years' + (`y' - 2005)'
    replace year = `y' in `= (`d' - 1) * `n_years' + (`y' - 2005)'
    }
    }

    gen treated = 0
    * Randomly assign 20 out of 50 districts to be treated
    gen randnum = runiform()
    sort randnum
    replace treated = 1 if district <= 20 /* First 20 districts are treated */

    gen post = (year >= 2015)
    gen outcome = .

    * Pre-treatment period (2006-2014): Parallel increasing trends, with treated having higher baseline
    gen base_control = 50 + runiform(0, 5) // Baseline outcome level for control
    gen base_treated = 65 + runiform(0, 5) // Higher baseline for treated units

    gen trend_control = 2 + runiform(0, 0.5) // Trend for control units
    gen trend_treated = 2 + runiform(0, 0.5) // Parallel trend for treated units

    * Add random noise
    gen noise = rnormal(0, 10) // Increased noise with a standard deviation of 10

    replace outcome = base_control + trend_control*(year - 2005) + noise if treated == 0 & year < 2015
    replace outcome = base_treated + trend_treated*(year - 2005) + noise if treated == 1 & year < 2015

    bysort district: egen outcome_2014_control = mean(cond(year == 2014 & treated == 0, outcome, .))
    replace outcome = outcome_2014_control + rnormal(0, 5) if treated == 0 & year >= 2015
    drop outcome_2014_control

    replace trend_treated = 4 + runiform(0, 0.5) if treated == 1 & year >= 2015 // Treated accelerates further
    replace outcome = base_treated + trend_treated*(year - 2005) + noise if treated == 1 & year >= 2015

    * Create technology variable that changes over time and repeats every 15 years
    gen technology = 1 + 0.2 * (year - 2006) // A simple increasing technology variable from 1 to 3 over time

    * Adjust the impact of technology on the outcome
    * For control units, technology has a smaller effect (multiplied by 10)
    replace outcome = outcome + 10 * technology if treated == 0

    * For treated units, technology has 3 times the effect (multiplied by 30)
    replace outcome = outcome + 30 * technology if treated == 1

    * Drop intermediate variables
    drop base_control base_treated trend_control trend_treated randnum noise

    sort district year

    xtset district year

    gen interaction=treated*post

    My question is: What kind of data generation is needed so that we can get different coefficients for interaction term when running diff-in-diff regression with and without time dummies?

  • #2
    treated is collinear with i.district and post is collinear with i.year. You are capturing i.year with post. In fact, treated and post are estimated only because of the way you've ordered the variables (the coefficients are just one of the i.district or i.year values). Use reghdfe instead and you'll see they are not estimated.

    If you want the DID coefficient to be different with i.year included, leave post out of the first regression so that you do not account for time.

    Comment

    Working...
    X