Dear all
I am trying to replicate the well-known paper by Callaway and Sant'Anna (2021) Difference-in-Differences with multiple time periods (https://doi.org/10.1016/j.jeconom.2020.12.001). Rather surprisingly, I haven't been able to find any do-file that allows for this in Stata, not even in the authors' personal pages. I had a go at it using the very useful Fernando Ríos-Avila's materials, specifically Playing with Stata (friosavila.github.io). The code that attempts to replicate Table 3 in the paper (arguably the main table) is copied below. Note that the original data can be found at
https://github.com/pedrohcgs/CS_RR
where it is stored as an "rds" file. I am attaching the converted CSV version of such a file (dropping some variables to make it uploadable in the Statalist forum), which is, in turn, used in the code below.
Panel A in Table 3 is mostly replicated: csdid without any controls allows me to replicate rows 2, 3, 4, and 5 in the paper, where the TEs are aggregated in different ways. Fine. But still, 2 questions remain
i) Where does the coefficient in row 1 in the paper, TWFE, come from? The paper says, "... we first estimate the coefficient on a post-treatment dummy variable in a model with unit fixed effects and region-year fixed effects...". The command above (under "Row 1") results in 0.0177 but the one in the paper is −0.037. Any idea what is the correct specification?
ii) Does csdid allow us to obtain the last row (Row 6: Event study w/ Balanced groups) automatically? Of course, this can be done manually, but I am wondering if this has been automatized
Panel B is somewhat replicated: rows 2 and 3 are, but the rest are not. Of course, this boils down to the model that I have interpreted from the paper, using variables from Table 2. Importantly, the paper says "... We use the doubly robust estimation procedure discussed above. [...] For each generalized propensity score, we estimate a logit model that includes each county characteristic along with quadratic terms for population and median income. For the outcome regressions, we use the same specification for the covariates".
i) My understanding is that, typically, doubly robust methods allow to specify separately an outcome model and a treatment model (see e.g. teffects aipw). But csdid does not allow such decoupling: the model is the same for both. This, in turn, does not allow following what is declared in the original paper, where 2 different models are defined. Why this decoupling is not allowed in this case? Is this what is driving the divergent results? I checked drdid, and it does not allow such decoupling either. Hence, how can the specification implicitly declared in the paper be achieved?
ii) What is the specification to obtain row 1 TWFE in this case with controls? I get 0.0165 but the paper reports −0.008
Any insight into this will be greatly appreciated, and hopefully, it will also help those who are trying to replicate the paper!
Many thanks in advance
JM
I am using Stata 17.0
ps: if the attachment does not work, you can open R and run this bit of code after you download the data in https://github.com/pedrohcgs/CS_RR
ls()
rm(list = ls())
getwd()
setwd('PERSONALFOLDER')
min_wage <- readRDS('min_wage_CS.rds')
write.csv(as.matrix(min_wage),file="min_wage_CS.cs v")
the file uploaded here, min_wage_CS_reduced, drops unnecessary variables from the original dataset
I am trying to replicate the well-known paper by Callaway and Sant'Anna (2021) Difference-in-Differences with multiple time periods (https://doi.org/10.1016/j.jeconom.2020.12.001). Rather surprisingly, I haven't been able to find any do-file that allows for this in Stata, not even in the authors' personal pages. I had a go at it using the very useful Fernando Ríos-Avila's materials, specifically Playing with Stata (friosavila.github.io). The code that attempts to replicate Table 3 in the paper (arguably the main table) is copied below. Note that the original data can be found at
https://github.com/pedrohcgs/CS_RR
where it is stored as an "rds" file. I am attaching the converted CSV version of such a file (dropping some variables to make it uploadable in the Statalist forum), which is, in turn, used in the code below.
Code:
import delimited "min_wage_CS_reduced.csv", clear case(lower) /* treat is treatment qualifier: 1 if treat at any point, 0 o/w countyreal is a decode of county_name in the original data */ rename firsttreat first_treat gen post_treatm =inlist(year, 2004, 2005, 2006, 2007) gen w =post_treatm*treat egen region_year=group(region year) sort countyreal year xtset countyreal year, yearly *Table 3 //Panel A ///Row 1: TWFE xtreg lemp w i.region_year, fe vce(cluster countyreal) preserve csdid lemp, ivar(countyreal) time(year) gvar(first_treat) /// agg(event) saverif(results_unconditional) replace estat pretrend use results_unconditional, clear ///Row 2 csdid_stats simple ///Row 3: Group-specific effects csdid_stats group ///Row 4: Event Study csdid_stats event ///Row 5: Calendar time effects csdid_stats calendar ///Row 6: Event study e=0 e=1 w/ Balanced groups *? restore //Panel B ///Row 1: TWFE local controls i.region c.white c.hs c.pov c.pop##c.pop c.medinc##c.medinc xtreg lemp w i.region_year (`controls')##i.year, fe vce(cluster countyreal) preserve csdid lemp i.region white hs pov c.pop##c.pop c.medinc##c.medinc, ivar(countyreal) time(year) gvar(first_treat) method(drimp) /// agg(event) saverif(results_conditional) replace estat pretrend use results_conditional, clear ///Row 2 csdid_stats simple ///Row 3: Group-specific effects csdid_stats group ///Row 4: Event Study csdid_stats event ///Row 5: Calendar time effects csdid_stats calendar ///Row 6: Event study e=0 e=1 w/ Balanced groups *? restore
i) Where does the coefficient in row 1 in the paper, TWFE, come from? The paper says, "... we first estimate the coefficient on a post-treatment dummy variable in a model with unit fixed effects and region-year fixed effects...". The command above (under "Row 1") results in 0.0177 but the one in the paper is −0.037. Any idea what is the correct specification?
ii) Does csdid allow us to obtain the last row (Row 6: Event study w/ Balanced groups) automatically? Of course, this can be done manually, but I am wondering if this has been automatized
Panel B is somewhat replicated: rows 2 and 3 are, but the rest are not. Of course, this boils down to the model that I have interpreted from the paper, using variables from Table 2. Importantly, the paper says "... We use the doubly robust estimation procedure discussed above. [...] For each generalized propensity score, we estimate a logit model that includes each county characteristic along with quadratic terms for population and median income. For the outcome regressions, we use the same specification for the covariates".
i) My understanding is that, typically, doubly robust methods allow to specify separately an outcome model and a treatment model (see e.g. teffects aipw). But csdid does not allow such decoupling: the model is the same for both. This, in turn, does not allow following what is declared in the original paper, where 2 different models are defined. Why this decoupling is not allowed in this case? Is this what is driving the divergent results? I checked drdid, and it does not allow such decoupling either. Hence, how can the specification implicitly declared in the paper be achieved?
ii) What is the specification to obtain row 1 TWFE in this case with controls? I get 0.0165 but the paper reports −0.008
Any insight into this will be greatly appreciated, and hopefully, it will also help those who are trying to replicate the paper!
Many thanks in advance
JM
I am using Stata 17.0
ps: if the attachment does not work, you can open R and run this bit of code after you download the data in https://github.com/pedrohcgs/CS_RR
ls()
rm(list = ls())
getwd()
setwd('PERSONALFOLDER')
min_wage <- readRDS('min_wage_CS.rds')
write.csv(as.matrix(min_wage),file="min_wage_CS.cs v")
the file uploaded here, min_wage_CS_reduced, drops unnecessary variables from the original dataset
Comment