Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Assigning treatment and control group in DID?

    Dear all,

    I am evaluating a change in health insurance using a differences-in-difference (DID) method. I have a question about the treatment-control assignment, which I am not sure if I am going to the right direction. My study contexts are as follows:
    1) I have annual cross-sectional data in five years from 2010, 2013, 2015, 2017 and 2020
    2) in 2014, there was a change in heath insurance policy, stating that pensioner are entitled 95% coverage of healthcare costs, meaning they have to pay only 5% of the total health costs when using health services. Before that year, in 2010 and 2013, pensioners and others who have a health insurance card have to pay a co-payment rate of 20%, the rest of 80% are covered by health insurance.
    3) Here is how my research design looks like:
    3.1) Post-periods are from 2015 onwards and pre-periods are 2010 and 2014
    3.2) Treatment and control definitions.
    Option 1: treated people would be individuals who were retired in 2015 or in later years, while the control group would be individuals who were retired in 2010 and 2013.
    Option 2: treatment group is defined similarly to option 1, whereas the control group individuals who were retired in 2010 and 2013 + plus others who have a health insurance card.
    In both options, I restrict the sample to individuals who have an insurance card only and excluding those who do not.

    Since this is the first time I have conducted a DID study, any advice or suggestion is highly appreciated. Thank you.

  • #2
    treated people would be individuals who were retired in 2015 or in later years, while the control group would be individuals who were retired in 2010
    I don't understand. What is the intervention or change you're interested in?

    Comment


    • #3
      Originally posted by Jared Greathouse View Post
      I don't understand. What is the intervention or change you're interested in?
      Hi Jared,

      I though I provided the objective of my study, but it turned out that I did not. Sorry about that. My research aims to examine the effect of a reduction in co-payment on health care utilization. As you can see in #1 that the co-payment rate for pensioners after 2014 was dropped from 20% to 5%. Thus, I based on eligibility for this reduction and the time the law was introduced to define the treatment and control group. I hope it is now clear to you. How do you think about the treatment-control assignment in #1? Thanks.

      Comment


      • #4
        It would just be =1 if they received the intervention and the date is greater than 2014.

        However, this isn't the end of the story. You also must compare people on both sides of the cutoff. You mention "eligibility" criteria, and when I hear that word, fire alarms go off in my head which say "regression discontinuity".

        So your design must compare people who were both eligible for the treatment AND received the treatment, to similarly situated people who were ineligible for the treatment but were otherwise the same on background covariates. This is called a differences in discontinuities approach, which combines DD (before and after) with RD (greater or less than the cutoff). For example, say the age you gotta be is 60 to be eligible. You'd be comparing people who were treated to those who were untreated, but also very similar (say, ages 57-59).

        I can't give code until you give me a dataset to work with.

        Comment


        • #5
          Hi Jared,

          Thank you for your quick reply. RD seems interesting and I will think about it. Now, let's discuss DID method in my context. My DID can be written as follows:
          h = alphaT + betaP + gamaT*P + error (I remove the constant for simplicity)
          where T is the treatment indicator, taking a value of 1 if being treated and 0 otherwise; P is an indicator for the policy exposure with 1 being in the post-period (e.g., 2015, 2017 and 2020) and 0 otherwise; gama is the parameter of interest. What I want to have advice is about the indicator T. In other words, who would be assigned in the treatment and who would be assigned in the control group? My though is that the treatment consists of individuals who were retired in 2015 or later (note that these people could retire earlier such as in 2013 or 2010). The control group would be individuals whose retirement status were observed in 2013 and 2010. That means I compare differences in using healthcare services among pensioners and the only difference between them is their exposure to the reform. I am not sure if my thought is correct so I would be grateful if you could advice me on this case.

          P/s: I am cleaning the data so I was unable to post a data example at this moment, but I will do once I have completed the data cleaning process.

          Thanks.

          Comment


          • #6
            If this were my problem, then it would be whoever received the co-payment reduction. The copayment reduction is the thing of interest, so whoever got it is the treated units, everyone else would just be 0.

            Comment


            • #7
              I think this is a very thorny problem, and I don't see any really good solutions. The problems with using a straightforward DID approach where T = retired in 2015 are several:
              1. They will be younger on average than those who retired earlier, and age is a huge determinant of the health care utilization outcome. So there is this monster confounder, and there is a very high probability that the parallel trends assumption will be false.
              2. The decision to retire may well be predicated on ill-health, so there is this endogeneity.
              Nor can I think of any modified definition of T that gets around these problems satisfactorily.

              I don't think I would use a DID approach to this. Regression discontinuity might be helpful here, but given how frequently it is misused and abused in the literature, I tend to be skeptical of RD analyses, which has discouraged me from really learning it well.

              That said, this might be a situation for using an approach that I do not use in my own work*: instrumental variables. I'm thinking here of reaching the age of eligibility for retirement in 2015 as an instrument. As I do not use instrumental variables in my own work, and my knowledge of it is very limited, I can't give you specific advice on implementation. But it does strike me as a place where this might work.

              *My non-use of instrumental variables does not arise, unlike regression discontinuity, from any concerns about the validity of the technique or the ways in which it is commonly used. It is just that in epidemiology, it is so seldom the case that a suitable instrument can be found. I guess the one area of epidemiology where it is used fairly often is Mendelian randomization, but I don't do genetic epidemiology.

              Comment


              • #8
                Dear Jared and Prof. Clyde,

                Thank you so much for your insightful inputs. I have thought about your advice, and you are right, a pure DID may not be appropriate here. An IV or RD method combined with DID may worth trying. In my situation, not all people reach the age of 60 (for men, for instance) will retire, some of them may still continue working so the probability of moving from working to retirement is not from zero to one. That means, a fuzzy RD (an akin of IV) could be appropriate, where I will use the official age of retirement to instrument for retirement status. Econometric equations for RD-DID can be formulated as follows:

                For simplicity, the following equations are for men only.

                First stage (see the next post)


                Second stage (see the next post)


                Yit is the the outcome of interest of individual i in month t. Age is age in months and Threshold is 60. Officialit is equal to one if Age>=Threshold and zero otherwise. Treatmentt indicates periods (the year 2010 onwards) in which copayments were reduced for individuals above the given age threshold (e.g., =1 if year=2015 or 2017 or 2020 and 0 otherwise). The coefficient of interest is δ, which captures the additional jumps induced by the change in copayments during treatment periods.

                However, I am still not sure how to define the control group? Jared provided a good advice on the treatment group
                people who were both eligible for the treatment AND received the treatment
                That means the treatment group must meet two simultaneous conditions: i) includes aged 60 or above in 2010 or in later years and; ii) and these people are already retired in 2010 or in later years. Is my understanding correct?
                For the control group
                who were ineligible for the treatment but were otherwise the same on background covariates
                Does it mean that the control group includes individuals who were younger than 60 years old in 2010 or in the years later?. Are there any additional conditions that I missed for the control group?

                I would greatly appreciate it if you both could take a look at my econometric equations as well as the treatment-control assignment.

                Thank you.
                Last edited by Matthew Williams; 15 Sep 2022, 08:59.

                Comment


                • #9
                  First stage
                  Click image for larger version

Name:	Screen Shot 2022-09-15 at 23.48.00.png
Views:	1
Size:	67.6 KB
ID:	1682090


                  Second stage
                  Click image for larger version

Name:	Screen Shot 2022-09-15 at 23.48.12.png
Views:	1
Size:	61.9 KB
ID:	1682091


                  Yit is the the outcome of interest of individual i in month t. Age is age in months and Threshold is 60. Officialit is equal to one if Age>=Threshold and zero otherwise. Treatmentt indicates periods (the year 2010 onwards) in which copayments were reduced for individuals above the given age threshold (e.g., =1 if year=2015 or 2017 or 2020 and 0 otherwise). The coefficient of interest is δ, which captures the additional jumps induced by the change in copayments during treatment periods.
                  Last edited by Matthew Williams; 15 Sep 2022, 09:00.

                  Comment


                  • #10
                    As I see it, in this framework, the treatment group is those people who were eligible to retire and actually did retire in 2015. The control group is everybody else.

                    As for your equations, an obvious problem is that Age is age in months, and Threshold = 60 implies that people are eligible to retire at age 5 years. Clearly you did not mean that.

                    Other than that, they look reasonable to me. But caveat emptor: I am not an econometrician, and as I have previously indicated, I have only modest knowledge of the techniques being proposed here, so do not feel very reassured by my concurrence.

                    Comment


                    • #11
                      Dear Prof. Clyde,

                      Thank you so much for your response. Yes, you are right, the threshold should be expressed in months. In #9, I mistakenly expressed it in years. Thank you for your constant support. Much appreciated.

                      Comment


                      • #12
                        Dear Prof. Clyde Schechter and Jared Greathouse,

                        I am sorry for mentioning you again, however, I have completed cleaning the data and generated key variables. Although I quite understand the econometric models in #9, I am not sure if the way I generated key variables is correct. Thus, I would greatly appreciate it if you could take a look at my code below. Thank you.
                        Code:
                            * Create the treatment period
                            gen T = 0
                                replace T = 1 if inlist(year, 2015, 2017, 2020)
                            
                            *** Create the threhold for men
                            /* Since the policy was in effect in July 2014 and the offical retirement age
                               for men is 60, people who were born in January 1954 would turn 60 in 2014
                               and thus eligible to the policy. Similarly, for women the corresponding month
                               and year for them to turn 55 is January 1959
                            */
                            
                            * Create a avariable named jan54 for men
                            gen jan54 = ym(1954, 1)
                            
                            * Create a avariable named jan59 for women
                            gen jan59 = ym(1959, 1)
                            
                            * Generate a running variable
                            * ym is age in months of individuals
                        
                            gen R = ym - jan54 if female==0   // this is for men
                                replace R = ym - jan59 if female==1  // this is for women
                            
                            /* Now create a treatment group. That is, people who turned 55 (women) or 60 (men)
                               in July 2014 and did retire in 2015 or in 2017 or in 2020.
                               retire is an indicator of retirement status (1 = retired)
                            */
                            
                            gen D = 0
                                replace D = 1 if R>=0 & retire==1  
                                
                            /* R>=0 mean that R includes men aged 60 or older in 7/2014 and women aged 55
                              or older in 7/2014. This makes sure that I use only eligible people.
                            */
                                    
                            /* Generate an interaction term between treatment period and treatment group.
                               The coef of this interaction is δ as described in the second stage in #9
                            */
                            gen T_D = T*D
                        Interestingly, I find that the proportion of D and T_D is exactly the same, so I am not sure if I did something wrong.
                        Code:
                        . tab D
                        
                                  D |      Freq.     Percent        Cum.
                        ------------+-----------------------------------
                                  0 |    218,903       99.44       99.44
                                  1 |      1,237        0.56      100.00
                        ------------+-----------------------------------
                              Total |    220,140      100.00
                        
                        . tab T_D
                        
                                T_D |      Freq.     Percent        Cum.
                        ------------+-----------------------------------
                                  0 |    218,903       99.44       99.44
                                  1 |      1,237        0.56      100.00
                        ------------+-----------------------------------
                              Total |    220,140      100.00
                        Last edited by Matthew Williams; 19 Sep 2022, 11:02.

                        Comment


                        • #13
                          I don't/haven't worked with DIFF-in-DISC, but the poster below me seems to have found a command which will prevent you from having to do as much manual labor.

                          Comment


                          • #14
                            Re #12.

                            The code looks basically correct. It could be streamlined a bit, but let's not bother with that at this point.

                            So probably there is something wrong with your data. The two tabulations you show suggest, but do not prove, that D == T_D. So the first thing is to check that out.
                            Since T_D = T*D, we deduce D == T*D means that in every observation we must have T == 1 or D == 0 (or both). So run -tab T D- to see if that is true. The absence of the combination T == 0 & D == 1 would suggest that your dataset is lacking in information about the retirees prior to the policy being in place.

                            Without seeing example data, that is the best I can offer you.

                            Comment


                            • #15
                              Dear Prof. Clyde Schechter and Jared Greathouse,

                              Thank you for your advice and suggestions. I have found the issue of why the proportion of D and T_D is exactly the same. That is because of the way I created jan54 and jan59. Specifically, I used function ym() to create these two variables, so these two variables are negative values due to Stata rules on dates. Eventually, the following code did not give me the results I wanted, though the code was correct.
                              Code:
                                  gen R = ym - jan54 if female==0   // this is for men
                                      replace R = ym - jan59 if female==1  // this is for women
                              The problem was solved when I used age in years or another way to create age in months (not using ym() function).

                              Comment

                              Working...
                              X