Callaway and Sant'Anna Diff in Diff

Tariq Abdullah

Join Date: Apr 2021
Posts: 366

Callaway and Sant'Anna Diff in Diff

17 Jun 2021, 06:35

Hello Stata community,

I'm trying to replicated the recent development in DiD literature developed by Callaway and Sant'Anna with IPUMS CPS dataset by cleaning the data on stata and running the command DiD on R ( since R is easy to run the DiD command and I need the graph and event study ). Problem is when I am running the R command it says the : Error in pre_process_did(yname = yname, tname = tname, idname = idname, : The value of idname must be the unique (by tname)

If anyone is familiar with CPS data then this is common knowledge that CPSID variable is the panel ID variable that identified a unique person and if anyone uses the ASEC variable then "most probably" the person's interview doesn't get repeated in the same year [ I could be wrong here, correct mew if I'm ].

So, the error message I'm getting on R "The value of idname must be the unique (by tname [ tname is year in my case ] ) , shouldn't be a problem in my case since as far as I know in ASEC CPS the same person's interview doesn't get related in same year , and therefore the error message "The value of idname must be the unique (by tname) " - shouldn't apply to my case.

So my questions to the community are :

1. Since according to my knowledge, there is no repepated observation of the same person in the same year of CPS ASEC ? Then why I'm getting the error message ?

2. And, If my idea is wrong about the reputation of the same person in ASE CPS , then on STATA how can I remove the related person based on the unique id ( CPSID ) variable ?

Will highly appreciate any kind response ! I'm attaching the R command so in case if anyone wants to see. The command is like the following and this is not my code , just a sample from internet :

# Estimating the effect on log(homicide)
	atts <- att_gt(yname = "l_homicide", # LHS variable
	tname = "year", # time variable
	idname = "sid", # id variable
	gname = "effyear", # first treatment period variable
	data = castle, # data
	xformla = NULL, # no covariates
	#xformla = ~ l_police, # with covariates
	est_method = "dr", # "dr" is doubly robust. "ipw" is inverse probability weighting. "reg" is regression
	control_group = "nevertreated", # set the comparison group which is either "nevertreated" or "notyettreated"
	bstrap = TRUE, # if TRUE compute bootstrapped SE
	biters = 1000, # number of bootstrap iterations
	print_details = FALSE, # if TRUE, print detailed results
	clustervars = "sid", # cluster level
	panel = TRUE) # whether the data is panel or repeated cross-sectional

Tags: None

FernandoRios

Join Date: Apr 2014

Posts: 2411
#2

17 Jun 2021, 08:00

Hi Tariq
As with any method, the devil is on the details.
It isn't clear from your description what type of information you are using, if you collapsed data (as it in the example), or if you are using some other data structure, what are you dep variable, or what is your treatment.
Without this information, is hard to provide much advice.
That being said, If you provide with more details on your design, it may be easier to tell you why your code is not working. If it is R specific, you can always contact Pedro Sant'Anna or Brant Callay. They both are active and very helpful.
Also, I do have couple of posts regarding their method that could helpyou
https://friosavila.github.io/playingwithstata/
You will also find there my own take on csdid, (joint work with Pedro), which implements the method
Best wishes
Comment
Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#3

17 Jun 2021, 08:16

So, I'm trying to see how the variable wage is getting affected after an application of a law across fa handful states in USA. My dependent variable is ln wage , id name = cpsid ( which is the panel id in IPUMS CPS data ). My treatment are few states which implemented that law and there are states who were never treated. So, I have all the variable I need to run this command.

The problem I'm having is about the repeated panel id in the same year for idname which doesn't let me run the command on R. I need to remove the same repeated observation using the CPSID variable , so that my panel ID remains unique for the same year and it doesn't get related in the same year.

Therefore, I'm asking for the advice how can I remove the same person's information interview or info in the same year from my data using stata command ( So that when I want to run the DiD command I don't have any error on R ).

Btw, I follow your work on this recent stata development for DiD literature and can't appreciate your work enough on behalf of the stata community. Particularly the CSDID has been a huge contribution from your part. Much appreciated !
Comment
Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#4

17 Jun 2021, 08:29

* Example generated by -dataex-. For more info, type help dataex
clear
input int year double(cpsid incwage)
1980 19800203031500 0
1980 19781205216100 .
1980 19800101881100 37638
1980 19800101209300 22837.545
1980 19791201557600 0
1980 19800105334500 0
1980 19790306165400 0
1980 19800201881000 27540
1980 19800104281900 .
1980 19800301990500 .
1980 19800203026200 33892.56
1980 19791200721000 13770
1980 19781202539200 0
1980 19800104281900 .
1980 19800200899800 .
1980 19790104365700 32130
1980 1.97912033e+13 0
1980 19800104205300 .
1980 19800204106800 26392.5
Comment
Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#5

17 Jun 2021, 08:39

out1 <- att_gt(yname="ln_incwage",
tname="year",
idname="cpsid",
gname="laweffective_year",
xformla=~male + age + age2 + age3 + age4 + black + asian + hispanic + lths + hsdegree + somecollege ,
data=dat)

This is how my command looks like but the error message tells me thus : Error in pre_process_did(yname = yname, tname = tname, idname = idname, : The value of idname must be the unique (by tname)
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2411
#6

17 Jun 2021, 09:17

Got it!
Ok, so I think the problem is because the default option in "did" is that it assumes your data is panel data (Panel = True). However, since you do not have panel data, I believe that should be set as "F".
Other than that, Im not sure if idname is required when you have repeated cross section. Perhaps you can skip that from the code.
Alternatively, look into the helpfile (in R), and check for the example that uses the repeated crossection estimator (I think it uses a simulated dataset ).

If you use csdid, you could do the same by not add the "ivar()" option. However, I think you may want to cluster your data at the state level (cluster(state))

Oh, don't forget that "laweffective" should be 0 if never treated.

Hope this helps
Comment
Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#7

17 Jun 2021, 10:05

thank you so much ! that helps a lot to make my understanding of the whole concept better.

so, when i am gonna use csdid on stata i’ll use null in case of ivar or i shouldn’t mention the ivar in the command at all ??
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2411
#8

17 Jun 2021, 10:09

You shouldnt use ivar at all. When you do, my command will assume you have panel data. So you may get a weird error.
You may want to use cluster(state) tho, because that is the level of the treatment.
Also, If you use CSDID, I stillhave some problems where you have "too many" treatment groups and time periods.
This is a limitation of how I programmed the results, so it may not be relevant for you. But worth considering.
Fernando
Comment
Tariq Abdullah

Join Date: Apr 2021

Posts: 366
#9

17 Jun 2021, 10:34

thanks again for being so kindly and patently answering my question ! you have been a great help !
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 147
#10

19 Dec 2021, 00:50

I am trying to use Callaway and Sant'Anna design. In my setting the adoption of some law in the states is staggered & i do not have a pure comparison group (never-treated). However, Callaway and Sant'Anna argue that one can use "not yet treated" by time (t+delta) as a comparison group for the units treated at (t)- the assumption-5 in the paper.
I am not sure how to proceed with that in STATA . Any suggestion may be helpful. I am not a R user, so everything i run is in STATA only.
Thanks
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2411
#11

19 Dec 2021, 06:57

Hi Ridwan
that is easy. in the commandline, just add "notyet". And that will use only not yet treated observations as controls.
Nevertheless, you need to observe units before they were treated to be able to use this option.
Best wishes
F
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 147
#12

19 Dec 2021, 08:05

Thanks FernandoRios ...!
Is this a correct way of writing the command in that case?

Code:

csdid lemp lpop, ivar(countyreal) time(year) gvar(first_treat) method (dripw) notyet saverif(rif)

Since i am sort of learning it from your dataset, would it be possible to use "notyet" as a comparison group in your data set and check the difference in coefficient estimates when "never treated" is used as comparison group VS when "notyet" is used as a comparison group .
As of now i have not created my own data set (it may take some time), i was learning it from yours and i may bother you in future too- Please get back to me then when i require it most.
Thanks
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2411
#13

19 Dec 2021, 09:21

Couple of thoughts.
1. Notyet include those never treated. Unless you indicate that explicitly (if gvar!=0 for example).
2. You cannot compare both estimators automatically. But you could do it manually.
when you use "saverif" you are asking to store all relevant information to create Standard errors. So you can do that for the default option and notyet option.
You will need to merge both RIF files, (change variable names), and then make the comparisons manually using mean command, for example, and then test for the difference.
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 147
#14

19 Dec 2021, 11:31

Thanks FernandoRios , your suggestion of specifying "notyet" worked successfully.
The following line of commands I ran in STATA using your dataset:

Code:

csdid lemp lpop, ivar(countyreal) time(year) gvar(first_treat) method (dripw) notyet saverif(rif)

Code:

estat cevent, window(0 2)

Code:

estat event

Code:

csdid_plot

The coefficients and their confidence-bands are different as expected once we use "not-yet treated" as a comparison group. However, they are not much different . Is this a special case in this (your) dataset or is it generally the case with Callaway and Sant'Anna ?

I understand the second point you made above, but i am not clear about the first point that much. Under what circumstances should one specify (if gvar!=0). I tried adding it in the above regression "csdid", but it produces some errors. Maybe i have made some mistake in correctly specifying it. Please clarify what is the use of (if gvar!=0) and under what conditions we should specify it?

I am sorry for asking too much.

Thanks
Comment

Announcement

Callaway and Sant'Anna Diff in Diff

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment