Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difference in difference with multiple treatments

    Dear Statalist Users,

    I am trying to measure the impact of Uber on the earnings of taxi drivers and would greatly appreciate advice on how best to tackle this on STATA 16.1.

    I have the following data:
    1) I have pooled cross sectional individual level data on taxi drivers across 5 major US cities across 10 years: Earnings (dependant variable lnincearn) and various individual and city level characteristics (age, gender, citizenship, unemployment)
    2) Data on when uber was introduced in their specific city: Dummy variable UBER which takes the value 1 if uber was present in their city in that year and 0 otherwise

    I am looking to measure the impact of Uber on their earnings while controlling for individual and city level characteristics and have looked into using xtset/xtreg which says there are too many time values within panel.

    Would greatly appreciate suggestions on how I could approach this via STATA.

    Below is a snapshot of my data where met2013 is the identifier for the city the individual is in and lincearn is the log of their earnings.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int year long met2013 float(uber lincearn)
    2009 12420 0  10.12663
    2009 12420 0  9.680344
    2009 12420 0 10.308952
    2009 12420 0 10.308952
    2009 12420 0  9.740969
    2009 12420 0  9.423838
    2009 12420 0  9.775654
    2009 12420 0 10.714417
    2009 12420 0 10.596635
    2009 12420 0 10.819778
    2009 12420 0  9.998797
    2009 12420 0 9.2103405
    2009 12420 0  9.798127
    2009 12420 0  9.425451
    2009 12420 0         .
    2009 12420 0  9.723164
    2009 12420 0  10.12663
    2009 12420 0 10.819778
    2009 12420 0 10.341743
    2009 12420 0   11.0021
    2009 12420 0  10.08581
    2009 12420 0 10.645425
    2009 12420 0         .
    2009 12420 0  9.287301
    2009 12420 0 10.463103
    2009 12420 0  9.159047
    2009 12420 0 10.308952
    2009 12420 0 10.488493
    2009 12420 0 12.538967
    2009 12420 0         .
    2009 12420 0  8.987197
    2009 12420 0   6.39693
    2009 12420 0  10.37349
    2009 12420 0         .
    2009 12420 0  9.903487
    2009 12420 0 10.691945
    2009 12420 0  5.703783
    2009 12420 0         .
    2009 12420 0 9.2103405
    2009 12420 0         .
    2009 12420 0  8.853665
    2009 12420 0  10.18112
    2009 12420 0  9.746834
    2009 12420 0 9.1049795
    2009 12420 0 10.060492
    2009 12420 0  10.37349
    2009 12420 0  8.294049
    2009 12420 0   9.92818
    2009 12420 0  9.798127
    2009 12420 0  7.901007
    2009 12420 0         .
    2009 12420 0  9.392662
    2009 12420 0 11.141862
    2009 12420 0  9.169518
    2009 12420 0 10.714417
    2009 12420 0   11.0021
    2009 12420 0 10.491274
    2009 12420 0 10.691945
    2009 12420 0  8.961879
    2009 12420 0  10.12663
    2009 12420 0   6.55108
    2009 12420 0 10.819778
    2009 12420 0         .
    2009 12420 0 10.645425
    2009 12420 0         .
    2009 12420 0 10.308952
    2009 12420 0 10.714417
    2009 12420 0  8.853665
    2009 12420 0         .
    2009 12420 0 10.596635
    2009 12420 0 10.308952
    2009 12420 0  8.764053
    2009 12420 0 10.518673
    2009 12420 0  9.539644
    2009 12420 0 10.203592
    2009 12420 0  8.294049
    2009 12420 0 10.437053
    2009 12420 0 10.645425
    2009 12420 0   6.55108
    2009 12420 0 10.778956
    2009 12420 0  9.903487
    2009 12420 0 10.485703
    2009 12420 0  9.392662
    2009 12420 0 10.819778
    2009 12420 0         .
    2009 12420 0 10.714417
    2009 12420 0   9.11603
    2009 12420 0         .
    2009 12420 0         .
    2009 12420 0  8.699514
    2009 12420 0 10.404263
    2009 12420 0  10.23996
    2009 12420 0   11.0021
    2009 12420 0 10.645425
    2009 12420 0  8.881836
    2009 12420 0  8.517193
    2009 12420 0  9.615806
    2009 12420 0  9.305651
    2009 12420 0  10.16969
    2009 12420 0  9.305651
    end
    label values year year_lbl
    label def year_lbl 2009 "2009", modify
    label values met2013 met2013_lbl
    label def met2013_lbl 12420 "Austin-Round Rock, TX", modify
    Kind regards,
    Aayush Bakshi

  • #2
    Let me make sure I'm understanding you- you have 5 treated cities in your analysis, yes? How many cities are there in total?

    EDIT: You write that you
    have looked into using xtset/xtreg which says there are too many time values within panel.
    I don't believe for a moment that Stata literally said this. What did Stata really tell you?
    Last edited by Jared Greathouse; 29 Mar 2022, 19:52.

    Comment


    • #3
      Dear Jared,

      Thanks for your reply.

      I have a total of 5 cities in my analysis: Some were treated at an earlier time while some were treated later. For example, Some cities had Uber in 2011 while others had it in 2015.

      Apologies, let me clarify the xtset/xtreg issue
      The code I used was:
      Code:
      xtset met2013 year
      where met2013 is the identifier of which city that individual is in

      I received the output "repeated time values within panel" from this command.

      I am not too sure how to approach this. I can use aggregate values for each city but will lose out on controlling for individual level characteristics so this is not ideal. Or is staggered DiD a strategy?

      Thanks,
      Aayush

      Comment


      • #4
        Okay so I have a few thoughts about this. Firstly, you have a very small number of treated units. In fact, all of your units are treated, thus you cannot construct the counterfactual, since you've no units to compare your treated units to in the post policy period. In stats terms, you've no potential outcomes- you observe all units under treatment and have nothing to compare them to.

        Assuming you had 20 untreated cities along with your 5 treated ones, imbalances in treatment time can be addressed by commands written by Rios, Chaisemartin and others and Xu and others.

        Now for the outcome: maybe I'm wrong or you can give the econometric theory better than me, but I'm not sure why we'd need individual level data here. After all, the policy isn't being applied directly to the individual, it's being applied to the city the individual lives or works in. I suppose we COULD use individual level data, but if this were my problem (which actually it sort of is, I'm doing an uber analysis right now), I would simply aggregate this to the city level, which brings me full circle to the moral of the story:

        You need more data. 5 all eventually treated cities wouldn't convince me of parallel trends or similarity on common factors (certainly at the individual level), and it also won't work because you need a group of units that was never exposed to the intervention.


        Oh, the reason you're getting repeated time ids is because your data is individual level data. If you and I are both in Atlanta and we xtset on city, the ID for Atlanta will appear twice because we both live there and thus cannot uniquely ID Atlanta. Another reason to do your work at the city level, in my opinion, unless you just wanna make an id for the individual.

        Comment


        • #5
          Thanks for your great advice, I have taken a lot from your comments and built upon this:

          1) Most importantly, I have included untreated cities in my analysis which never had Uber within the time frame.

          But have two options from this analysis: (taken from https://www.statalist.org/forums/for...fference%C2%A0)

          First alternative (at the individual level)

          Yist = α + βTst + as + θt + εist,
          i – individual, s – state, t – year
          Tst – Whether state s had the treatment by year t


          Second alternative (at the state level)

          Yst = α + βTst + as + θt + εst,

          s – state, t – year

          Yst is the average of dependent variable for all individuals in state s at time t.
          Tst – Whether state s had the treatment by year t

          Ideally, for the nature of the paper and as I want to look at the impact of individual characteristics I would like to run it at individual level rather than at state level. Furthermore, I have concerns with standard errors, test statistics, and confidence intervals with state level data.

          How would I run this analysis through STATA? what would you recommend?

          Comment


          • #6
            So, how many untreated units are there? The helpfiles for the commands I've cited will guide you on how to implement it. I'd likely use synthetic controls or some never version of the difference-in-differences commands I mentioned. What concerns you about your SEs, t-stats and CIs with state data?

            Comment


            • #7
              This looks like a staggered intervention with pooled cross sections. You probably want to allow some heterogeneity in the treatment effects by treatment cohort and calendar time; presumably the initial effect of Uber was smaller than the effects later. This is easy to do by defining cohort dummies for the different initial treatment periods. Then, these get interacted with time dummies in the post treatment periods.

              I difficult question is computing standard errors. If you condition on the treatment assignment then you can use heteroskedasticity-robust standard errors. But if you want to account for the uncertainty in the "policy" assignment, you should cluster. The problem is, clustering with few treated cities might not work well.

              You can collapse the data to city-level panel data but, again, the clustering issue can be problematical. It's worth a try, though.

              Here's a link to a paper of mine, along with Stata files, that discusses the panel data case. The issues with pooled cross sections are similar.


              https://www.dropbox.com/sh/zj91darud...bgsnxS6Za?dl=0

              Comment


              • #8
                Then, these get interacted with time dummies in the post treatment periods.
                Jeff Wooldridge is this essentially what's going on under the hood with the newer DD estimators like this one?

                Comment


                • #9
                  You need more data. 5 all eventually treated cities wouldn't convince me of parallel trends or similarity on common factors (certainly at the individual level), and it also won't work because you need a group of units that was never exposed to the intervention.
                  Actually, this design, with a treatment being introduced sequentially into different cohorts, is increasingly used in epidemiology, and particularly in clinical-translational research. It is called the stepped-wedge. And even though all participants are ultimately treated, the design can be thought of as defining a sequence of eras. The first era is before the first group gets treated. The second era begins when the first group gets treated. The third era begins when the third group gets treated, etc. The final era is when all groups have been treated. Within each era other than the first and last you have a synchronic comparison between treated and untreated groups, and you also have within each group the within-group comparison of pre- and post- treatment outcomes. These can be combined to provide an estimate of treatment effect. The analysis has to include time indicators and group indicators. This gives at least partial adjustment for secular trends and between-group baseline differences. And usually when we do this in clinical-translational research, we randomize the order in which the groups begin treatment. Evidently in this context randomization is not possible. The question posed here also has a wrinkle in that we usually assume that treatment effects are constant over time, whereas here it is specifically assumed to be otherwise. But, as Jeff Wooldridge has pointed out, this can be handled with some extra interaction terms.

                  Comment


                  • #10
                    If I understand you well, you're essentially saying that (let's say with 3 total units), we have a pre-period for all units (no treated units, era 1 where unit 1 is treated (compared to the other two units), era two where unit two is compared to the other now untreated unit, and era three where all our units are now treated. Clyde Schechter is that about right?

                    I've actually advocated (well I didn't invent it, I adapted it) a similar approach for synthetic controls, in concert with recent advances in difference-in-differences, as is the case here from what I can tell. I think I'll clarify what I meant by my first comment above- I agree that it's mechanically possible to estimate this. EDIT: My main motivation for suggesting more data, is having more panels makes it more likely that we can have units which are suitable comparison groups. If we have an intervention in Atlanta, sure, I can compare Atlanta to Miami, Charlotte, and Boston, but wouldn't it (usually) be better to have a more representative sample of other units? Of course this isn't always possible, but I just figured with Uber this might be useful.
                    Last edited by Jared Greathouse; 30 Mar 2022, 21:42.

                    Comment


                    • #11
                      Yes, that's about right. Actually, in era two, units 1 and 2, which are treated, are compared to unit 3 which is still untreated. But otherwise, yes, that's how it works. In each era except first and last, there is a comparison of all treated units with all remaining untreated units. And there is also the within unit comparison of pre- and post-treatment.

                      And I agree that more units makes for a better design. In the typical clinical-translational application, though, the units have many participants each, but units, being typically medium to large size groups, are difficult to recruit. So you usually end up with a modest number of units.

                      Comment


                      • #12
                        Jeff Wooldridge Thanks for your reply,

                        I have read through your paper as well as watched your helpful seminar. This is precisely what I am aiming to do with my study, I have a few follow up questions with regards to the stata code:

                        1) Since I will be using pooled cross sectional data, this will not be Two Way Fixed Effects, is that correct? I will be using something along the lines of

                        Code:
                         
                         reg lincearn uber i.year x d2 d3 d4 c.d2#c.x c.d3#c.x c.d4#c.x c.f02#c.x c.f03#c.x c.f04#c.x, vce(cluster id)
                        This is just example code adapted from your paper but I will be using the reg instead of xtreg and also adding interaction terms between cohorts dummies and post treatment time dummies. The variable Uber will take a value one if the treatment was applied in that city during that year and it will be the coefficient of interest. Am I correct in saying this?

                        2) In terms of computing standard errors, would you recommend clustering if I were to add more treated/untreated cities?
                        3) Would this method be possible if all cities were eventually treated within the time period? Or would I require cities that were fully untreated?

                        Comment


                        • #13
                          I don't understand the one way interaction terms. Why did you use them? I'm not asking rhetorically, by the way, since I've only ever seen two way interactions used in practice.

                          You have panel data. Pooled cross section, whatever you'd like to call it, you observe the same units over a time period- thus, this is the standard DD setup
                          Code:
                          xtreg  lincearn uber i.year, vce(cl id)
                          I wouldn't use this setup myself due to staggered interventions as well as heterogeneous treatment effects.... but this is the start. Regarding the SEs, the econometrics Gods have spoken on this subject, and the Gospel according to Abadie and co sayeth the following:
                          With fixed effects, one should cluster if either (i) both PCn < 1 (clustering in the sampling) and there is heterogeneity in the treatment effects, or (ii) σ 2 > 0 (clustering in the assignment) and there is heterogeneity in the treatment effects
                          An econometrician may correct me, but it is certain you have heterogeneous effects, and it is likely the case that you've clustering in the assignment of the intervention.

                          Clyde and I spoke about all units eventually being treated. Given that I work in public policy by field, I've never encountered a situation where all units are eventually treated, but it's possible to do this in the DD/event study framework. Aayush Bakshi

                          Comment


                          • #14
                            Jared Greathouse, Apologies I meant to say that I would use something along the lines of:
                            Code:
                             
                              reg lincearn uber i.year uber#i.year uber#city#i.year, vce(cluster id)
                            to show the interaction term.

                            I am not sure if I am mistaken here but since my data surveys different individuals over a time period (and that I want to use individual data to control for individual characteristics), I would not be able to use xtreg in my regression.

                            Essentially I want to run a regression with the following form (with the additional interaction terms of course):
                            lincearnist = α + βUberst + sXi + as + θt + εist,
                            i – individual, s – city, t – year
                            Uberst – Whether city s had Uber by year t
                            Xi - vector with individual characteristics


                            and this would be my setup for a staggered DiD, would this be correct?

                            Appreciate the help and have learnt a lot about clustering from your message.

                            Comment


                            • #15
                              I still don't get the one way interaction terms. Do you know how to interpret them, even? Again I certainly don't mean to sound mean or rhetorical, I genuinely don't know how to even interpret those. You can still use individual covariates within the context of a DD regression if you want (unless they're time invariant with a fixed effects approach.

                              Trust me, any model with more than one interaction term is a nightmare to interpret, and you have a one 1-way interaction and one 3-way interaction term. Other than that, I think the main thing to be concerned with is imbalances in event time and heterogeneous treatment effects. You likely need one of Stata's more advanced DD commands to handle these things, unless this is just a class paper and your instructor likely won't know or care.

                              Comment

                              Working...
                              X