Hi everyone, this is my first post here and I'm very new, I hope this is somehow clear.
For my master thesis, I want to estimate the effect that replications have on the citations of a paper. For this, I wanted to make a comparison between the citations of papers that were once replicated vs papers that were not.
My supervisor told me that a more appropriate model would be a staggered diff-in-diff, with a Poisson regression, given the nature of my dependent variable (citations is a non-negative count number). However, he told me to try an initial "simple" Diff-in-diff to see the results, even if these could be biased.
In my dataset I have around 80 papers that were replicated in different years (treatment group), and 160 that were never replicated (control group). To ensure comparability, I took only empirical papers that were published in the same journals, volumes, issues, and about the same topics or JEL code. Here is a snippet:

So,if we look at a simple DiD equation, and apply it to my data, this is what I would like to estimate:
Citationsit = B0 + B1 * replicatedi + B2 * d_timet + B3 (replicatedi*d_timet) + covariatesit + uit
Where replicated is the treatment dummy (1 if replicated 0 if never replicated), and d_time is the time dummy (before=0 and after=1).
The issue I see here is that there is no "after" for my control group, since those papers were never replicated, and I cannot take the "after" of a single treated paper because all have different treatment years (some repeat but in general its 15 different years). Because of this, the construction of my time dummy makes d_time always 0 for my control group (replicated=0),
If I run such a model, my interaction term gets omitted because of perfect collinearity between the interaction term and d_time. What am I missing here?
Now, I know this is the "basic" DiD model, which may lead to some specification issues because of the nature of my data. But I want to implement it before trying any other more advanced method. Does anybody have suggestions?
(I am trying to understand the Callaway & Sant'Ana paper about the Staggered way, but I'm having issues understanding the implementation and interpretation of the model. Because, from what I understand, each treated paper could act as a control paper, if it was treated after another one (for example: a replicated paper in 2015 could act as a control paper for a paper replicated in 2010 and so on). However, I wanted to compare only NEVER-replicated papers with replicated ones, therefore I'm unsure whether this is the best appraoch)
For my master thesis, I want to estimate the effect that replications have on the citations of a paper. For this, I wanted to make a comparison between the citations of papers that were once replicated vs papers that were not.
My supervisor told me that a more appropriate model would be a staggered diff-in-diff, with a Poisson regression, given the nature of my dependent variable (citations is a non-negative count number). However, he told me to try an initial "simple" Diff-in-diff to see the results, even if these could be biased.
In my dataset I have around 80 papers that were replicated in different years (treatment group), and 160 that were never replicated (control group). To ensure comparability, I took only empirical papers that were published in the same journals, volumes, issues, and about the same topics or JEL code. Here is a snippet:
So,if we look at a simple DiD equation, and apply it to my data, this is what I would like to estimate:
Citationsit = B0 + B1 * replicatedi + B2 * d_timet + B3 (replicatedi*d_timet) + covariatesit + uit
Where replicated is the treatment dummy (1 if replicated 0 if never replicated), and d_time is the time dummy (before=0 and after=1).
The issue I see here is that there is no "after" for my control group, since those papers were never replicated, and I cannot take the "after" of a single treated paper because all have different treatment years (some repeat but in general its 15 different years). Because of this, the construction of my time dummy makes d_time always 0 for my control group (replicated=0),
If I run such a model, my interaction term gets omitted because of perfect collinearity between the interaction term and d_time. What am I missing here?
Now, I know this is the "basic" DiD model, which may lead to some specification issues because of the nature of my data. But I want to implement it before trying any other more advanced method. Does anybody have suggestions?
(I am trying to understand the Callaway & Sant'Ana paper about the Staggered way, but I'm having issues understanding the implementation and interpretation of the model. Because, from what I understand, each treated paper could act as a control paper, if it was treated after another one (for example: a replicated paper in 2015 could act as a control paper for a paper replicated in 2010 and so on). However, I wanted to compare only NEVER-replicated papers with replicated ones, therefore I'm unsure whether this is the best appraoch)
Comment