Diff-in-Difference with multiple treatment periods

Laura Gonzalez Gaitan

Join Date: Jun 2024

Posts: 5
#1

Diff-in-Difference with multiple treatment periods

21 Jul 2024, 09:43

Hi everyone, this is my first post here and I'm very new, I hope this is somehow clear.

For my master thesis, I want to estimate the effect that replications have on the citations of a paper. For this, I wanted to make a comparison between the citations of papers that were once replicated vs papers that were not.

My supervisor told me that a more appropriate model would be a staggered diff-in-diff, with a Poisson regression, given the nature of my dependent variable (citations is a non-negative count number). However, he told me to try an initial "simple" Diff-in-diff to see the results, even if these could be biased.

In my dataset I have around 80 papers that were replicated in different years (treatment group), and 160 that were never replicated (control group). To ensure comparability, I took only empirical papers that were published in the same journals, volumes, issues, and about the same topics or JEL code. Here is a snippet:

So,if we look at a simple DiD equation, and apply it to my data, this is what I would like to estimate:

Citations_it= B₀ + B_{1 *}replicated_i + B2 * d_time_t + B₃ (replicated_i*d_time_t) + covariates_it + u_it

Where replicated is the treatment dummy (1 if replicated 0 if never replicated), and d_time is the time dummy (before=0 and after=1).

The issue I see here is that there is no "after" for my control group, since those papers were never replicated, and I cannot take the "after" of a single treated paper because all have different treatment years (some repeat but in general its 15 different years). Because of this, the construction of my time dummy makes d_time always 0 for my control group (replicated=0),

If I run such a model, my interaction term gets omitted because of perfect collinearity between the interaction term and d_time. What am I missing here?

Now, I know this is the "basic" DiD model, which may lead to some specification issues because of the nature of my data. But I want to implement it before trying any other more advanced method. Does anybody have suggestions?

(I am trying to understand the Callaway & Sant'Ana paper about the Staggered way, but I'm having issues understanding the implementation and interpretation of the model. Because, from what I understand, each treated paper could act as a control paper, if it was treated after another one (for example: a replicated paper in 2015 could act as a control paper for a paper replicated in 2010 and so on). However, I wanted to compare only NEVER-replicated papers with replicated ones, therefore I'm unsure whether this is the best appraoch)
Tags: difference-in-difference, interaction, panel data
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

21 Jul 2024, 10:24

But I want to implement it before trying any other more advanced method. Does anybody have suggestions?

It is mathematically impossible to use the classic DID model with this data and my suggestion is that you stop trying. You need a generalized DID model. In that model there is a key variable, which for your situation might be aptly named has_been_replicated taking on the value 0 if the paper has not yet ever been replicated up to the year of the observation, and 1 if it has been replicated in or before the year of the observation. You can create that easily with -gen byte has_been_replicated = year >= rep_year-. The model is then:

Code:

gen byte has_been_replicated = year >= rep_year xtset paper_id year // ASSUMES paper_id IS A VALUE LABELED NUMERIC VARIABLE* xtreg citations i.has_been_replicated i.year, fe

The model equation corresponding to this is

citations_it = b0 + b1*has_been_replicated + b_t*year_t + u_i + e_it

The coefficient of has_been_replicated is then the (generalized) DID estimator of the effect of replication. Note: I have omitted covariates from the model for simplicity of exposition, but they can be added in as needed. Bear in mind that because both year and publication fixed effects appear in the model, there is no role for any covariate that is fixed over time for each publication (e.g. number of authors) or fixed across publications for each year (e.g. annual revenues of the publication industry).

I think this is what your supervisor actually had in mind when he suggested a simple model. This is a simple model in that it is linear rather than Poisson. I also think the Poisson model makes more sense, not just because you have a count outcome variable, but because the effect of replication on the number of cites seems likely to be multiplicative rather than additive. (Caveat: bibliographic studies is not my area of expertise and I'm just going on lay intuition here.)

*If paper_id is in fact a string variable, you will need to use -encode- to make it a value labeled numeric variable: -xtset- will not accept a string variable. From the type of display of the data you show, it is impossible to tell which is the case here. In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
George Ford

Join Date: Aug 2014

Posts: 3120
#3

22 Jul 2024, 08:15

Think about an event-style analysis in long differences. Center the treatment dates and look at citations 5-year or 10-year after replication.
Comment

Jeff Wooldridge

Join Date: Apr 2014
Posts: 2121

23 Jul 2024, 06:54

This looks like a job for jwdid or csdid. You have to specify the first treatment time with zero for the never treated.

Code:

gen first_treat = 0
replace first_treat = 2018 if rep_year == 2018
replace first_treat = 2019 if rep_year == 2019
(and so on)
jwdid citations x, ivar(paper_id) tvar(year) gvar(first_treat)
estat event
estat plot
csdid citations x, ivar(paper_id)  time(year) gvar(first_treat) method(reg) long2
estat event
csdid_plot

Announcement

Diff-in-Difference with multiple treatment periods

Comment

Comment

Comment