Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trouble implementing csdid with repeated cross-section data and individual-level treatment

    Hello Statalist,

    I am doing an evaluation of a policy in the United States, by studying its effect on voting behavior at the individual level. My outcome variable is thus an indicator that takes the value of 1 if that individual voted in that year's election, 0 otherwise.

    My data series of repeated cross-sections, spaced out every two years between 2004 and 2020. The policy started in 2012 and a problematic factor is that treatment is defined on an individual level. Individuals are considered "treated" (I am working towards ITT estimates) if they meet certain income criteria.

    My goal is to implement the semiparametric DiD estimator proposed by Abadie (2005, Review of Economic Studies), applied to repeated cross-section data and generalized to several periods. For this purpose, I turn to the - csdid - package.

    I am able to implement a 2x2 design using - drdid - with the following command (post takes the value of 1 if year >= 2012):

    Code:
    drdid votvar female black age agesq married yrseduc nchild state_unempld [weight = weightvar], time(post) tr(treated) all cluster(stateid)
    .
    To generalize this to several periods, I turned to the csdid command, but I am having trouble with the gvar argument. I was initially using the following command:

    Code:
    csdid votvar female black age agesq married yrseduc nchild state_unempld [weight = weightvar], time(year) gvar(first_treat) ipw cluster(stateid)
    but I honestly am not sure what kind of variable I should indicate in the gvar argument. I first defined first_treat as a variable that takes the value of the year if an individual is treated in that year, but I think this is wrong.

    I may not even be correctly implementing this specification. I suspect the individual-level treatment indicator is problematic in this context, but I am somewhat inexperienced with DID with repeated cross-section so I am not sure.

    Am I even correct in attempting this? If so, how should I define the gvar variable? I have read through csdid's documentation and I know it is meant to indicate the period in which an observation is first treated, but given that I am using repeated cross-sections, I am not sure how this is relevant because each individual is only observed during one period.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int year byte(post votvar) float treated byte stateid double weightvar byte(female black age) int agesq byte(married yrseduc nchild) float state_unempld
    2004 0 1 1 1  2623.443 1 0 38 1444 1 14 2 .068462305
    2004 0 . 1 1 2722.3819 1 0 32 1024 1 12 2 .068462305
    2004 0 1 1 1 2429.3012 0 0 36 1296 1 16 1 .068462305
    2004 0 1 1 1 2749.4369 0 0 41 1681 1 16 2 .068462305
    2004 0 1 1 1  2623.443 1 0 35 1225 1 16 1 .068462305
    2004 0 1 0 1  2469.438 1 0 43 1849 1 16 1 .068462305
    2004 0 1 0 1 2661.5829 0 0 51 2601 1 13 1 .068462305
    2006 0 0 1 1  2324.241 1 0 33 1089 1 12 2 .064763226
    2006 0 1 1 1 3084.0646 1 0 40 1600 1 13 2 .064763226
    2006 0 1 1 1 3176.7826 1 0 45 2025 1 16 2 .064763226
    2006 0 1 1 1 2802.3992 0 1 38 1444 1 16 2 .064763226
    2006 0 0 1 1 2577.4423 1 0 25  625 1 12 1 .064763226
    2006 0 1 1 1 3593.7706 0 0 48 2304 1 16 2 .064763226
    2006 0 0 1 1 2661.4183 0 0 34 1156 1 12 1 .064763226
    2006 0 1 1 1 2480.7861 1 1 38 1444 1 13 2 .064763226
    2006 0 . 1 1 3176.7826 1 0 47 2209 0 12 1 .064763226
    2006 0 1 0 1 2410.8311 1 0 35 1225 0 12 1 .064763226
    2006 0 0 1 1 2428.1492 0 0 34 1156 1 13 2 .064763226
    2008 0 1 1 1 3323.5928 0 0 61 3721 1 16 2 .068048485
    2008 0 0 0 1 3650.3775 0 0 49 2401 1 13 1 .068048485
    2008 0 1 1 1 2989.0131 1 0 47 2209 1 17 2 .068048485
    2008 0 1 0 1 2989.0131 1 0 46 2116 1 12 1 .068048485
    2008 0 1 1 1 3671.8115 0 0 39 1521 1 14 1 .068048485
    2008 0 1 1 1 3892.0033 1 0 37 1369 1 16 2 .068048485
    2008 0 1 1 1  2782.015 0 0 35 1225 1 13 2 .068048485
    2008 0 0 1 1 3273.7527 0 0 41 1681 1 13 2 .068048485
    2008 0 1 1 1 3156.2368 1 0 31  961 1 16 1 .068048485
    2008 0 . 1 1 3026.6301 1 0 25  625 1 12 2 .068048485
    2008 0 . 1 1 3754.9383 0 0 26  676 1 10 2 .068048485
    2008 0 1 1 1 2853.1402 1 0 39 1521 1 16 2 .068048485
    2010 0 . 0 1 2637.3475 0 0 45 2025 1 17 1  .11254114
    2010 0 1 0 1 2925.8712 1 0 45 2025 1 13 1  .11254114
    2010 0 . 0 1  2581.067 1 0 49 2401 1 16 1  .11254114
    2010 0 . 1 1 2813.3854 1 0 37 1369 0  5 3  .11254114
    2010 0 0 0 1 3418.3679 0 0 49 2401 1 17 1  .11254114
    2010 0 . 0 1 2577.1412 1 0 42 1764 1 13 1  .11254114
    2010 0 . 0 1 2552.9962 1 0 34 1156 1 13 1  .11254114
    2010 0 . 0 1 3418.3679 0 0 47 2209 1 12 1  .11254114
    2010 0 . 0 1 2116.2538 0 0 45 2025 1 11 1  .11254114
    2012 1 1 0 1 3519.0209 0 0 46 2116 1 17 2  .09638353
    2012 1 1 0 1 2817.1964 1 0 42 1764 1 16 2  .09638353
    2012 1 0 0 1 3143.4283 0 0 41 1681 0 12 2  .09638353
    2012 1 0 0 1 3206.4962 0 0 43 1849 0 13 2  .09638353
    2012 1 1 0 1 3584.0283 0 0 42 1764 1 12 1  .09638353
    2012 1 1 1 1 3677.8484 1 0 34 1156 1 12 2  .09638353
    2012 1 0 0 1 2865.6223 1 0 36 1296 0 12 2  .09638353
    2012 1 1 0 1 2851.6203 1 0 34 1156 1 16 2  .09638353
    2012 1 1 0 1 3626.2316 1 0 47 2209 1 13 1  .09638353
    2012 1 1 0 1 2827.5482 0 0 41 1681 1 16 1  .09638353
    2012 1 1 0 1 3269.7352 0 0 35 1225 1 16 2  .09638353
    2012 1 1 0 1 2918.4123 1 0 47 2209 1 13 2  .09638353
    2012 1 1 1 1 3324.0039 0 0 36 1296 1 13 2  .09638353
    2012 1 0 0 1 3519.0209 0 0 45 2025 0 12 2  .09638353
    2012 1 1 0 1  3833.862 0 0 49 2401 1 17 2  .09638353
    2012 1 1 0 1  3447.875 1 0 28  784 1 13 1  .09638353
    2012 1 0 0 1 2725.3815 1 0 51 2601 1 17 3  .09638353
    2014 1 . 1 1 2188.7591 1 0 51 2601 1 13 2  .08329498
    2014 1 0 0 1   1492.99 0 0 60 3600 1 20 1  .08329498
    2014 1 . 1 1 3370.5631 0 0 55 3025 1 13 2  .08329498
    2014 1 1 0 1 1670.9967 0 0 52 2704 1 17 1  .08329498
    2014 1 0 0 1 1831.7712 0 1 39 1521 1 16 1  .08329498
    2014 1 0 0 1 1532.8349 1 0 45 2025 1 16 1  .08329498
    2014 1 1 1 1 1882.3686 1 1 36 1296 0 10 3  .08329498
    2014 1 1 0 1 2316.1321 0 0 34 1156 0 13 1  .08329498
    2014 1 0 1 1 1832.7644 1 0 46 2116 0 16 2  .08329498
    2014 1 . 0 1 1412.5225 1 0 37 1369 1 13 2  .08329498
    2014 1 1 0 1 1861.0222 1 0 42 1764 1 16 2  .08329498
    2014 1 0 0 1 1417.4637 0 0 40 1600 1 13 2  .08329498
    2014 1 1 0 1 2060.7882 0 0 44 1936 1 17 2  .08329498
    2014 1 1 0 1 1752.6919 1 0 52 2704 1  9 1  .08329498
    2014 1 0 0 1 2438.8565 0 0 53 2809 1 13 2  .08329498
    2014 1 1 0 1 1699.9555 1 1 37 1369 1 16 1  .08329498
    2016 1 1 1 1 1667.8133 0 0 43 1849 1 12 2  .06175678
    2016 1 1 0 1 1447.7026 1 0 52 2704 0 16 1  .06175678
    2016 1 1 1 1 1320.0783 1 0 36 1296 1 17 2  .06175678
    2016 1 1 0 1 1959.0251 0 0 31  961 0 16 1  .06175678
    2016 1 1 0 1 2050.7085 1 0 32 1024 1 13 1  .06175678
    2016 1 0 0 1 1623.5081 1 0 45 2025 0 16 2  .06175678
    2016 1 1 1 1  2284.904 1 0 38 1444 1 13 2  .06175678
    2016 1 0 0 1 1373.7601 1 0 33 1089 1 12 2  .06175678
    2016 1 1 1 1 1334.1652 1 0 44 1936 0 16 1  .06175678
    2016 1 1 1 1 1590.1385 0 0 39 1521 1 17 2  .06175678
    2016 1 1 0 1  1603.083 0 0 35 1225 1 12 2  .06175678
    2016 1 1 0 1 1690.5178 0 0 34 1156 1 13 1  .06175678
    2016 1 0 0 1 1320.0783 1 0 37 1369 0 11 1  .06175678
    2018 1 0 0 1 1889.1662 0 0 46 2116 1 12 2  .05406519
    2018 1 1 0 1 1824.2471 1 0 30  900 1 14 2  .05406519
    2018 1 1 0 1 1682.9579 0 0 40 1600 1 16 2  .05406519
    2018 1 1 0 1  1677.085 0 0 32 1024 1 13 2  .05406519
    2018 1 1 0 1 1889.1662 0 0 49 2401 1 13 2  .05406519
    2018 1 1 1 1 1656.8716 0 0 58 3364 1 17 2  .05406519
    2018 1 1 0 1 1828.8339 1 0 44 1936 1 16 2  .05406519
    2018 1 . 0 1 1910.1095 1 0 29  841 1 16 2  .05406519
    2018 1 1 1 1 1978.0334 1 0 47 2209 1 16 2  .05406519
    2018 1 0 0 1 1732.7991 1 0 35 1225 1 16 1  .05406519
    2018 1 0 0 1 1600.3391 0 0 45 2025 1 14 1  .05406519
    2018 1 0 1 1 1716.7169 1 0 39 1521 0 12 3  .05406519
    2018 1 0 0 1 1657.2191 1 0 45 2025 1 13 2  .05406519
    2018 1 . 0 1 1543.8662 0 0 31  961 1 16 2  .05406519
    2018 1 1 0 1 1896.5009 1 0 45 2025 1 16 2  .05406519
    end
    label values age AGE
    label def AGE 25 "25", modify
    label def AGE 26 "26", modify
    label def AGE 28 "28", modify
    label def AGE 29 "29", modify
    label def AGE 30 "30", modify
    label def AGE 31 "31", modify
    label def AGE 32 "32", modify
    label def AGE 33 "33", modify
    label def AGE 34 "34", modify
    label def AGE 35 "35", modify
    label def AGE 36 "36", modify
    label def AGE 37 "37", modify
    label def AGE 38 "38", modify
    label def AGE 39 "39", modify
    label def AGE 40 "40", modify
    label def AGE 41 "41", modify
    label def AGE 42 "42", modify
    label def AGE 43 "43", modify
    label def AGE 44 "44", modify
    label def AGE 45 "45", modify
    label def AGE 46 "46", modify
    label def AGE 47 "47", modify
    label def AGE 48 "48", modify
    label def AGE 49 "49", modify
    label def AGE 51 "51", modify
    label def AGE 52 "52", modify
    label def AGE 53 "53", modify
    label def AGE 55 "55", modify
    label def AGE 58 "58", modify
    label def AGE 60 "60", modify
    label def AGE 61 "61", modify
    label values nchild NCHILD
    label def NCHILD 1 "1 child present", modify
    label def NCHILD 2 "2", modify
    label def NCHILD 3 "3", modify
    I am using Stata 17 SE on a Windows PC.

    Thank you.
    Last edited by Anxo Ferreiro; 30 Aug 2022, 08:04.

  • #2
    Hi Anxo
    the definition of gvar is correct. But it is somewhat less clear of how to use it when one works with repeated crossection data, because you do not see the same individuals across time, thus you do not know what would be the right "timing" for that group.

    So, if I understand correctly, you can only see if an individual is treated or not (has enough income to be considered treated or not) but do not know "when" that individual met that income level.
    Unfortunately, if you do not have this piece of information, you cannot account for timing and cohort heterogeneity.

    Perhaps if you provide me with more information on the problem, I may be able to provide with better feedback
    Fernando

    Comment


    • #3
      Hi Fernando, thanks so much for your reply.

      In my data, there is information on income for each year, so I build the treatment indicator based on that variable. So, for 2008, there is an income variable corresponding to that year with values for each individual. I then generate the indicator for treatment based on this variable. The same applies for 2010, 2012 and so on. This means each individual is treated or not at each year.

      Therefore, if I generate the gvar as I described it in my original post, for the 2010 year, all individuals that are treated (based on their 2010 income) will have a gvar value of 2010, and the untreated ones will have a value of 0. In 2012, all treated individuals will have a value of 2012 and so on. I think this follows the definition of gvar, but still does not solve the problem. This is the error that I receive when I specify gvar like this:

      Code:
      (importance weights assumed)
      No never treated observations found. Using Not yet treated data
      Units always treated found. These will be excluded
      All observations require at least 1 not treated period
      See cross table below, and verify All Gvar have at least 1 not treated period
      
                 |                 first_treat
            year |      2008       2012       2016       2020 |     Total
      -----------+--------------------------------------------+----------
            2008 |     3,725          0          0          0 |     3,725 
            2012 |         0      3,138          0          0 |     3,138 
            2016 |         0          0      2,507          0 |     2,507 
            2020 |         0          0          0      1,736 |     1,736 
      -----------+--------------------------------------------+----------
           Total |     3,725      3,138      2,507      1,736 |    11,106 
      --Break--
      r(1);
      (In the code above, I have used data spaced out every four years, hence the jumps from 2008 to 2012, but the same would apply if I was using data spaced out every two years).
      Let me know if this information is any helpful.

      Thanks so much for your help.

      Comment


      • #4
        Hi Anxo
        There is no problem with the time gaps. its more about the gvar construction.
        For example,
        in the year 2008, you can identify individuals treated in that year. And you can probably also identify individuals that were never treated (controls).
        However, you should still be able to identify groups of individuals who would be potentially treated in all other years as well. 2012, 2016, 2020.

        in there words.
        consider only year 2008. Can you identify in this sample who would be potentially treated in 2012, 2016 or 2020?

        Comment


        • #5
          Hi Fernando,

          Unfortunately not. Within each year's cross section I can only identify if individuals are treated that year, but not in earlier or later years.

          I suppose this renders my identification strategy useless.

          Would the - drdid - specification still be valid?

          Comment


          • #6
            only if the treated and control groups are correctly identified.
            Meaning, you are assuming that income doesn't change across time, so if a unit is "treated" It was always treated.
            Now, I'm more curious about the definition of pre and post. how is that identified?

            Comment


            • #7
              Pre and post are defined as before or after 2012. The program I am evaluating began in 2012. Income-eligible individuals on or after this period would be treated. Income-ineligible individuals would not be treated. Before 2012, neither of the groups would receive treatment.

              Thanks so much for your help.

              Comment


              • #8
                ohhh, ok that makes much more sense now
                in that case.
                gvar=0 if they are not income eligible, and gvar=2012 if they are income eligible.
                then you can run csdid or drdid

                Comment


                • #9
                  Hi Fernando,

                  thanks so much for that. It worked. Just to make sure, this implementation (the csdid one) where I specify gvar = 2012 for income eligibe individuals in years earlier than 2012 still implies the assumption that, if an individual is treated in, say 2008, their treatment status would be the same in the next years?

                  Comment


                  • #10
                    based on your description, individuals were only treated in 2012. So it doesn't matter when the data is collected, your assumption is that income eligible individuals were (if data is collected after 2012), or will be treated in 2012 (if data is collected before 2012)

                    Comment


                    • #11
                      Hi you guys, I have a somehow similar problem to Anxo. Would you help me out?

                      I have 6 repeated cross sections of different men and women, one cross section for each bimester in one year. The program that I am evaluating starts in the 4th semester and affects only women. According to what I understood, I should define gvar=4 for women in the 1,2,3 and 4th bimester, and then gvar=5 and gvar=6 for women in the 5th and 6th bimester, respectively. But I get nothing. I would really appreciate the help!


                      Just by the way, I also considered defining treat as 4 for every women in every bimester, didn´t work.
                      Click image for larger version

Name:	56f3c0ab-8d82-4e3a-8a23-2efb2a294b53.PNG
Views:	1
Size:	32.0 KB
ID:	1699568

                      Last edited by Berenice Hernandez; 30 Jan 2023, 18:43.

                      Comment


                      • #12
                        First troubleshooting
                        can you
                        tab bm treat5
                        also i suspect that you can’t use bank ids dummies (too many dummies)

                        Comment


                        • #13
                          Sure, just a little disclaimer, I have just 5 bimesters and not 6 (sorry), this is normal from the way I defined them. The tab looks likes this:
                          Click image for larger version

Name:	tab1.png
Views:	1
Size:	23.0 KB
ID:	1699687

                          It may also help seeing that only observations where female==1 are assign 4 and 5 values for the treat5 variable.

                          tab2.png


                          In what concerns the bank_id, I have 14 different Banks, meaning 13 dummies but my data set consists of 3.5 million observations. Any thoughts or rules of thumb when adding controls are welcome

                          Thank you for your help!

                          Comment


                          • #14
                            ok so 3 points
                            1. you cannot use gender as control. Because it defines treatment
                            2. You cannot estimate effects for Treat =5, because you see them only 1 period.
                            3. For the one treated in period 4. You should be able to get something for them
                            at the very least you observe them at T=0 but not at any point after that
                            You should also be able to see something for any periods before.

                            Now, I suspect that there is something going on with your other explanatory variables. Perhaps missing? or string?

                            Comment


                            • #15
                              Hello,
                              I am facing a similar but still different issue.
                              I also have repeated cross-section data. I am trying to analyse the impact of the use of ICT for voting on trust of people in their government.

                              I consider that a country and thus all individuals are treated if the country use e-voting.
                              We have data from 1981 to 2023.
                              I created a variable treatment that takes the value 0 if they were never treated, and if a country was treated it takes the value 1 only after the first year of implementation.
                              So for example, bangladesh has treatment==0 for all years below 2018 and treatment==1for all years >=2018 as it was treated in 2018.

                              I use for the gvar the variable start_year that takes for value the first year when e-voting was implemented. It takes 0 if it was never treated.

                              When I do: csdid confidence_govt, cluster(Country) time(year_of_survey) gvar(start_year) method(dripw)

                              (year_of_survey is our time variable, it corresponds to the time when individuals have been interviewed)

                              It takes forever and does not give expected results.
                              Is there something that I am missing?

                              Thank you very much

                              Comment

                              Working...
                              X