Trouble implementing csdid with repeated cross-section data and individual-level treatment

Anxo Ferreiro

Join Date: Aug 2022
Posts: 5

Trouble implementing csdid with repeated cross-section data and individual-level treatment

30 Aug 2022, 06:32

Hello Statalist,

I am doing an evaluation of a policy in the United States, by studying its effect on voting behavior at the individual level. My outcome variable is thus an indicator that takes the value of 1 if that individual voted in that year's election, 0 otherwise.

My data series of repeated cross-sections, spaced out every two years between 2004 and 2020. The policy started in 2012 and a problematic factor is that treatment is defined on an individual level. Individuals are considered "treated" (I am working towards ITT estimates) if they meet certain income criteria.

My goal is to implement the semiparametric DiD estimator proposed by Abadie (2005, Review of Economic Studies), applied to repeated cross-section data and generalized to several periods. For this purpose, I turn to the - csdid - package.

I am able to implement a 2x2 design using - drdid - with the following command (post takes the value of 1 if year >= 2012):

Code:

drdid votvar female black age agesq married yrseduc nchild state_unempld [weight = weightvar], time(post) tr(treated) all cluster(stateid)

.
To generalize this to several periods, I turned to the csdid command, but I am having trouble with the gvar argument. I was initially using the following command:

Code:

csdid votvar female black age agesq married yrseduc nchild state_unempld [weight = weightvar], time(year) gvar(first_treat) ipw cluster(stateid)

but I honestly am not sure what kind of variable I should indicate in the gvar argument. I first defined first_treat as a variable that takes the value of the year if an individual is treated in that year, but I think this is wrong.

I may not even be correctly implementing this specification. I suspect the individual-level treatment indicator is problematic in this context, but I am somewhat inexperienced with DID with repeated cross-section so I am not sure.

Am I even correct in attempting this? If so, how should I define the gvar variable? I have read through csdid's documentation and I know it is meant to indicate the period in which an observation is first treated, but given that I am using repeated cross-sections, I am not sure how this is relevant because each individual is only observed during one period.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int year byte(post votvar) float treated byte stateid double weightvar byte(female black age) int agesq byte(married yrseduc nchild) float state_unempld
2004 0 1 1 1  2623.443 1 0 38 1444 1 14 2 .068462305
2004 0 . 1 1 2722.3819 1 0 32 1024 1 12 2 .068462305
2004 0 1 1 1 2429.3012 0 0 36 1296 1 16 1 .068462305
2004 0 1 1 1 2749.4369 0 0 41 1681 1 16 2 .068462305
2004 0 1 1 1  2623.443 1 0 35 1225 1 16 1 .068462305
2004 0 1 0 1  2469.438 1 0 43 1849 1 16 1 .068462305
2004 0 1 0 1 2661.5829 0 0 51 2601 1 13 1 .068462305
2006 0 0 1 1  2324.241 1 0 33 1089 1 12 2 .064763226
2006 0 1 1 1 3084.0646 1 0 40 1600 1 13 2 .064763226
2006 0 1 1 1 3176.7826 1 0 45 2025 1 16 2 .064763226
2006 0 1 1 1 2802.3992 0 1 38 1444 1 16 2 .064763226
2006 0 0 1 1 2577.4423 1 0 25  625 1 12 1 .064763226
2006 0 1 1 1 3593.7706 0 0 48 2304 1 16 2 .064763226
2006 0 0 1 1 2661.4183 0 0 34 1156 1 12 1 .064763226
2006 0 1 1 1 2480.7861 1 1 38 1444 1 13 2 .064763226
2006 0 . 1 1 3176.7826 1 0 47 2209 0 12 1 .064763226
2006 0 1 0 1 2410.8311 1 0 35 1225 0 12 1 .064763226
2006 0 0 1 1 2428.1492 0 0 34 1156 1 13 2 .064763226
2008 0 1 1 1 3323.5928 0 0 61 3721 1 16 2 .068048485
2008 0 0 0 1 3650.3775 0 0 49 2401 1 13 1 .068048485
2008 0 1 1 1 2989.0131 1 0 47 2209 1 17 2 .068048485
2008 0 1 0 1 2989.0131 1 0 46 2116 1 12 1 .068048485
2008 0 1 1 1 3671.8115 0 0 39 1521 1 14 1 .068048485
2008 0 1 1 1 3892.0033 1 0 37 1369 1 16 2 .068048485
2008 0 1 1 1  2782.015 0 0 35 1225 1 13 2 .068048485
2008 0 0 1 1 3273.7527 0 0 41 1681 1 13 2 .068048485
2008 0 1 1 1 3156.2368 1 0 31  961 1 16 1 .068048485
2008 0 . 1 1 3026.6301 1 0 25  625 1 12 2 .068048485
2008 0 . 1 1 3754.9383 0 0 26  676 1 10 2 .068048485
2008 0 1 1 1 2853.1402 1 0 39 1521 1 16 2 .068048485
2010 0 . 0 1 2637.3475 0 0 45 2025 1 17 1  .11254114
2010 0 1 0 1 2925.8712 1 0 45 2025 1 13 1  .11254114
2010 0 . 0 1  2581.067 1 0 49 2401 1 16 1  .11254114
2010 0 . 1 1 2813.3854 1 0 37 1369 0  5 3  .11254114
2010 0 0 0 1 3418.3679 0 0 49 2401 1 17 1  .11254114
2010 0 . 0 1 2577.1412 1 0 42 1764 1 13 1  .11254114
2010 0 . 0 1 2552.9962 1 0 34 1156 1 13 1  .11254114
2010 0 . 0 1 3418.3679 0 0 47 2209 1 12 1  .11254114
2010 0 . 0 1 2116.2538 0 0 45 2025 1 11 1  .11254114
2012 1 1 0 1 3519.0209 0 0 46 2116 1 17 2  .09638353
2012 1 1 0 1 2817.1964 1 0 42 1764 1 16 2  .09638353
2012 1 0 0 1 3143.4283 0 0 41 1681 0 12 2  .09638353
2012 1 0 0 1 3206.4962 0 0 43 1849 0 13 2  .09638353
2012 1 1 0 1 3584.0283 0 0 42 1764 1 12 1  .09638353
2012 1 1 1 1 3677.8484 1 0 34 1156 1 12 2  .09638353
2012 1 0 0 1 2865.6223 1 0 36 1296 0 12 2  .09638353
2012 1 1 0 1 2851.6203 1 0 34 1156 1 16 2  .09638353
2012 1 1 0 1 3626.2316 1 0 47 2209 1 13 1  .09638353
2012 1 1 0 1 2827.5482 0 0 41 1681 1 16 1  .09638353
2012 1 1 0 1 3269.7352 0 0 35 1225 1 16 2  .09638353
2012 1 1 0 1 2918.4123 1 0 47 2209 1 13 2  .09638353
2012 1 1 1 1 3324.0039 0 0 36 1296 1 13 2  .09638353
2012 1 0 0 1 3519.0209 0 0 45 2025 0 12 2  .09638353
2012 1 1 0 1  3833.862 0 0 49 2401 1 17 2  .09638353
2012 1 1 0 1  3447.875 1 0 28  784 1 13 1  .09638353
2012 1 0 0 1 2725.3815 1 0 51 2601 1 17 3  .09638353
2014 1 . 1 1 2188.7591 1 0 51 2601 1 13 2  .08329498
2014 1 0 0 1   1492.99 0 0 60 3600 1 20 1  .08329498
2014 1 . 1 1 3370.5631 0 0 55 3025 1 13 2  .08329498
2014 1 1 0 1 1670.9967 0 0 52 2704 1 17 1  .08329498
2014 1 0 0 1 1831.7712 0 1 39 1521 1 16 1  .08329498
2014 1 0 0 1 1532.8349 1 0 45 2025 1 16 1  .08329498
2014 1 1 1 1 1882.3686 1 1 36 1296 0 10 3  .08329498
2014 1 1 0 1 2316.1321 0 0 34 1156 0 13 1  .08329498
2014 1 0 1 1 1832.7644 1 0 46 2116 0 16 2  .08329498
2014 1 . 0 1 1412.5225 1 0 37 1369 1 13 2  .08329498
2014 1 1 0 1 1861.0222 1 0 42 1764 1 16 2  .08329498
2014 1 0 0 1 1417.4637 0 0 40 1600 1 13 2  .08329498
2014 1 1 0 1 2060.7882 0 0 44 1936 1 17 2  .08329498
2014 1 1 0 1 1752.6919 1 0 52 2704 1  9 1  .08329498
2014 1 0 0 1 2438.8565 0 0 53 2809 1 13 2  .08329498
2014 1 1 0 1 1699.9555 1 1 37 1369 1 16 1  .08329498
2016 1 1 1 1 1667.8133 0 0 43 1849 1 12 2  .06175678
2016 1 1 0 1 1447.7026 1 0 52 2704 0 16 1  .06175678
2016 1 1 1 1 1320.0783 1 0 36 1296 1 17 2  .06175678
2016 1 1 0 1 1959.0251 0 0 31  961 0 16 1  .06175678
2016 1 1 0 1 2050.7085 1 0 32 1024 1 13 1  .06175678
2016 1 0 0 1 1623.5081 1 0 45 2025 0 16 2  .06175678
2016 1 1 1 1  2284.904 1 0 38 1444 1 13 2  .06175678
2016 1 0 0 1 1373.7601 1 0 33 1089 1 12 2  .06175678
2016 1 1 1 1 1334.1652 1 0 44 1936 0 16 1  .06175678
2016 1 1 1 1 1590.1385 0 0 39 1521 1 17 2  .06175678
2016 1 1 0 1  1603.083 0 0 35 1225 1 12 2  .06175678
2016 1 1 0 1 1690.5178 0 0 34 1156 1 13 1  .06175678
2016 1 0 0 1 1320.0783 1 0 37 1369 0 11 1  .06175678
2018 1 0 0 1 1889.1662 0 0 46 2116 1 12 2  .05406519
2018 1 1 0 1 1824.2471 1 0 30  900 1 14 2  .05406519
2018 1 1 0 1 1682.9579 0 0 40 1600 1 16 2  .05406519
2018 1 1 0 1  1677.085 0 0 32 1024 1 13 2  .05406519
2018 1 1 0 1 1889.1662 0 0 49 2401 1 13 2  .05406519
2018 1 1 1 1 1656.8716 0 0 58 3364 1 17 2  .05406519
2018 1 1 0 1 1828.8339 1 0 44 1936 1 16 2  .05406519
2018 1 . 0 1 1910.1095 1 0 29  841 1 16 2  .05406519
2018 1 1 1 1 1978.0334 1 0 47 2209 1 16 2  .05406519
2018 1 0 0 1 1732.7991 1 0 35 1225 1 16 1  .05406519
2018 1 0 0 1 1600.3391 0 0 45 2025 1 14 1  .05406519
2018 1 0 1 1 1716.7169 1 0 39 1521 0 12 3  .05406519
2018 1 0 0 1 1657.2191 1 0 45 2025 1 13 2  .05406519
2018 1 . 0 1 1543.8662 0 0 31  961 1 16 2  .05406519
2018 1 1 0 1 1896.5009 1 0 45 2025 1 16 2  .05406519
end
label values age AGE
label def AGE 25 "25", modify
label def AGE 26 "26", modify
label def AGE 28 "28", modify
label def AGE 29 "29", modify
label def AGE 30 "30", modify
label def AGE 31 "31", modify
label def AGE 32 "32", modify
label def AGE 33 "33", modify
label def AGE 34 "34", modify
label def AGE 35 "35", modify
label def AGE 36 "36", modify
label def AGE 37 "37", modify
label def AGE 38 "38", modify
label def AGE 39 "39", modify
label def AGE 40 "40", modify
label def AGE 41 "41", modify
label def AGE 42 "42", modify
label def AGE 43 "43", modify
label def AGE 44 "44", modify
label def AGE 45 "45", modify
label def AGE 46 "46", modify
label def AGE 47 "47", modify
label def AGE 48 "48", modify
label def AGE 49 "49", modify
label def AGE 51 "51", modify
label def AGE 52 "52", modify
label def AGE 53 "53", modify
label def AGE 55 "55", modify
label def AGE 58 "58", modify
label def AGE 60 "60", modify
label def AGE 61 "61", modify
label values nchild NCHILD
label def NCHILD 1 "1 child present", modify
label def NCHILD 2 "2", modify
label def NCHILD 3 "3", modify

I am using Stata 17 SE on a Windows PC.

Thank you.

Last edited by Anxo Ferreiro; 30 Aug 2022, 07:04.

Tags: None

FernandoRios

Join Date: Apr 2014

Posts: 2469
#2

30 Aug 2022, 07:37

Hi Anxo
the definition of gvar is correct. But it is somewhat less clear of how to use it when one works with repeated crossection data, because you do not see the same individuals across time, thus you do not know what would be the right "timing" for that group.

So, if I understand correctly, you can only see if an individual is treated or not (has enough income to be considered treated or not) but do not know "when" that individual met that income level.
Unfortunately, if you do not have this piece of information, you cannot account for timing and cohort heterogeneity.

Perhaps if you provide me with more information on the problem, I may be able to provide with better feedback
Fernando
Comment
Anxo Ferreiro

Join Date: Aug 2022

Posts: 5
#3

30 Aug 2022, 07:48

Hi Fernando, thanks so much for your reply.

In my data, there is information on income for each year, so I build the treatment indicator based on that variable. So, for 2008, there is an income variable corresponding to that year with values for each individual. I then generate the indicator for treatment based on this variable. The same applies for 2010, 2012 and so on. This means each individual is treated or not at each year.

Therefore, if I generate the gvar as I described it in my original post, for the 2010 year, all individuals that are treated (based on their 2010 income) will have a gvar value of 2010, and the untreated ones will have a value of 0. In 2012, all treated individuals will have a value of 2012 and so on. I think this follows the definition of gvar, but still does not solve the problem. This is the error that I receive when I specify gvar like this:

Code:

(importance weights assumed) No never treated observations found. Using Not yet treated data Units always treated found. These will be excluded All observations require at least 1 not treated period See cross table below, and verify All Gvar have at least 1 not treated period | first_treat year | 2008 2012 2016 2020 | Total -----------+--------------------------------------------+---------- 2008 | 3,725 0 0 0 | 3,725 2012 | 0 3,138 0 0 | 3,138 2016 | 0 0 2,507 0 | 2,507 2020 | 0 0 0 1,736 | 1,736 -----------+--------------------------------------------+---------- Total | 3,725 3,138 2,507 1,736 | 11,106 --Break-- r(1);

(In the code above, I have used data spaced out every four years, hence the jumps from 2008 to 2012, but the same would apply if I was using data spaced out every two years).
Let me know if this information is any helpful.

Thanks so much for your help.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#4

30 Aug 2022, 08:08

Hi Anxo
There is no problem with the time gaps. its more about the gvar construction.
For example,
in the year 2008, you can identify individuals treated in that year. And you can probably also identify individuals that were never treated (controls).
However, you should still be able to identify groups of individuals who would be potentially treated in all other years as well. 2012, 2016, 2020.

in there words.
consider only year 2008. Can you identify in this sample who would be potentially treated in 2012, 2016 or 2020?
Comment
Anxo Ferreiro

Join Date: Aug 2022

Posts: 5
#5

30 Aug 2022, 08:10

Hi Fernando,

Unfortunately not. Within each year's cross section I can only identify if individuals are treated that year, but not in earlier or later years.

I suppose this renders my identification strategy useless.

Would the - drdid - specification still be valid?
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#6

30 Aug 2022, 08:58

only if the treated and control groups are correctly identified.
Meaning, you are assuming that income doesn't change across time, so if a unit is "treated" It was always treated.
Now, I'm more curious about the definition of pre and post. how is that identified?
Comment
Anxo Ferreiro

Join Date: Aug 2022

Posts: 5
#7

30 Aug 2022, 09:04

Pre and post are defined as before or after 2012. The program I am evaluating began in 2012. Income-eligible individuals on or after this period would be treated. Income-ineligible individuals would not be treated. Before 2012, neither of the groups would receive treatment.

Thanks so much for your help.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#8

30 Aug 2022, 09:08

ohhh, ok that makes much more sense now
in that case.
gvar=0 if they are not income eligible, and gvar=2012 if they are income eligible.
then you can run csdid or drdid
Comment
Anxo Ferreiro

Join Date: Aug 2022

Posts: 5
#9

31 Aug 2022, 14:32

Hi Fernando,

thanks so much for that. It worked. Just to make sure, this implementation (the csdid one) where I specify gvar = 2012 for income eligibe individuals in years earlier than 2012 still implies the assumption that, if an individual is treated in, say 2008, their treatment status would be the same in the next years?
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#10

31 Aug 2022, 14:53

based on your description, individuals were only treated in 2012. So it doesn't matter when the data is collected, your assumption is that income eligible individuals were (if data is collected after 2012), or will be treated in 2012 (if data is collected before 2012)
Comment
Berenice Hernandez

Join Date: Jan 2023

Posts: 2
#11

30 Jan 2023, 17:11

Hi you guys, I have a somehow similar problem to Anxo. Would you help me out?

I have 6 repeated cross sections of different men and women, one cross section for each bimester in one year. The program that I am evaluating starts in the 4th semester and affects only women. According to what I understood, I should define gvar=4 for women in the 1,2,3 and 4th bimester, and then gvar=5 and gvar=6 for women in the 5th and 6th bimester, respectively. But I get nothing. I would really appreciate the help!

Just by the way, I also considered defining treat as 4 for every women in every bimester, didn´t work.

Last edited by Berenice Hernandez; 30 Jan 2023, 17:43.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#12

30 Jan 2023, 17:31

First troubleshooting
can you
tab bm treat5
also i suspect that you can’t use bank ids dummies (too many dummies)
Comment
Berenice Hernandez

Join Date: Jan 2023

Posts: 2
#13

31 Jan 2023, 10:47

Sure, just a little disclaimer, I have just 5 bimesters and not 6 (sorry), this is normal from the way I defined them. The tab looks likes this:

It may also help seeing that only observations where female==1 are assign 4 and 5 values for the treat5 variable.

In what concerns the bank_id, I have 14 different Banks, meaning 13 dummies but my data set consists of 3.5 million observations. Any thoughts or rules of thumb when adding controls are welcome

Thank you for your help!
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#14

31 Jan 2023, 12:00

ok so 3 points
1. you cannot use gender as control. Because it defines treatment
2. You cannot estimate effects for Treat =5, because you see them only 1 period.
3. For the one treated in period 4. You should be able to get something for them
at the very least you observe them at T=0 but not at any point after that
You should also be able to see something for any periods before.

Now, I suspect that there is something going on with your other explanatory variables. Perhaps missing? or string?
Comment
manon ritto

Join Date: Mar 2024

Posts: 3
#15

29 Mar 2024, 04:02

Hello,
I am facing a similar but still different issue.
I also have repeated cross-section data. I am trying to analyse the impact of the use of ICT for voting on trust of people in their government.

I consider that a country and thus all individuals are treated if the country use e-voting.
We have data from 1981 to 2023.
I created a variable treatment that takes the value 0 if they were never treated, and if a country was treated it takes the value 1 only after the first year of implementation.
So for example, bangladesh has treatment==0 for all years below 2018 and treatment==1for all years >=2018 as it was treated in 2018.

I use for the gvar the variable start_year that takes for value the first year when e-voting was implemented. It takes 0 if it was never treated.

When I do: csdid confidence_govt, cluster(Country) time(year_of_survey) gvar(start_year) method(dripw)

(year_of_survey is our time variable, it corresponds to the time when individuals have been interviewed)

It takes forever and does not give expected results.
Is there something that I am missing?

Thank you very much
Comment

Announcement

Trouble implementing csdid with repeated cross-section data and individual-level treatment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment