Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question about Difference in Difference

    Hello,

    I have a question about the how number of observations effects the results of ifference in difference. Do the control and treatment groups need to be similar size? In my data I have a much larger number of obs for the control group than the treatment, and am getting imprecise results. Could this be part of the problem?

    Thanks,
    Neil

  • #2
    What matters is the actual number of entities in each group; whether they are equal or not is of no real importance. If the number of treatment entities is small, the results will be imprecise. (If the number of control entities is small, the results will also be imprecise.)

    It is true that, if you are able to choose the number of entities in each group, you get the most precision out of the same total number of treatment + control entities if they are equally allocated between the two conditions. But DID is usually used in observational studies and the investigator cannot choose the sample sizes.

    But the real principle that is important is this: the major determinant of the precision of the results is the size of whichever group is smaller. It is the actual size of the smallest group that matters most, not whether or not the two groups are equal.

    Don't forget that there are other determinants of precision besides sample size: what about the amount of extraneous variation in your outputs? Can you reduce that by including relevant covariates in your model?

    Comment


    • #3
      Thank you!

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        But the real principle that is important is this: the major determinant of the precision of the results is the size of whichever group is smaller. It is the actual size of the smallest group that matters most, not whether or not the two groups are equal.
        Hi, I am having the same question as Neil. I am not sure if I have understood correctly. If I have a treatment group of 35 individuals and a control group that can be formed by up to 600 individuals, what should I do practically?
        Making a control group of 35 or take more of them? I have understood that the problem comes from the small treatment group, but unfortunately I cannot change it.

        Thank you in advance!

        Comment


        • #5
          It doesn't hurt to use a large control group. My point is that it also doesn't help all that much. The statistical power of your design depends on the sizes of the two groups in proportion to 1/sqrt(1/n_treatment + 1/n_control). Notice that even if n_control were infinite, this factor would still be 1/sqrt(1/n_treatment): you cannot do any better than that no matter what.

          In your case, if you had an infinitely large control group, this factor would be 1/sqrt(1/35) = 5.92. Let' see what it is if you choose 1 control per case: 1/sqrt(1/35 + 1/35) = 4.18, so there is considerable room for improvement. With 2 controls per case, it's 1/sqrt(1/35 + 1/70) = 4.83; a bit better. At 10 controls per case, we get 5.64, a substantial improvement but notice that it's not much different from the upper bound of 5.92. Suppose you use all 600 available controls. You get 5.75, which is scarcely better than 10 controls per case, and scarcely worse than an infinite number of controls. You see the diminishing returns from more controls?

          Now, given that you have a total of 635 entities available, if it were possible (and I understand that, in your context, it is not) to divide them almost evenly between cases and controls, this factor would be 1/sqrt(1/317 + 1/318) = 12.6. That's a huge improvement over anything that can be done when the number of cases is fixed at 35. The 35 cases is the real bound on power here. If that's all there is, then that's the reality and you live with it. (Mathematically it is not hard to prove that if n_cases + n_controls is some fixed value N, then 1/sqrt(1/n_cases + 1/n_controls) is maximized when n_cases = n_controls = N/2.)

          So you would choose the number of controls by trading off the benefit of more controls illustrated in the preceding paragraphs against the cost of obtaining data on more controls. In my work, when we are gathering new data on people, the cost per additional participant is typically quite high, especially in longitudinal studies. But if we're just pulling existing data from an electronic data base, the marginal cost of an extra data record is almost zero. If we're gathering data from existing paper records, it's somewhere between those.

          Applying these principles to your situation should give you a sense of how many controls to use given the constraint that you have just 35 cases no matter what, and in light of whatever effort and resources must be expended to acquire control data.

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            It doesn't hurt to use a large control group. My point is that it also doesn't help all that much. The statistical power of your design depends on the sizes of the two groups in proportion to 1/sqrt(1/n_treatment + 1/n_control). Notice that even if n_control were infinite, this factor would still be 1/sqrt(1/n_treatment): you cannot do any better than that no matter what.

            In your case, if you had an infinitely large control group, this factor would be 1/sqrt(1/35) = 5.92. Let' see what it is if you choose 1 control per case: 1/sqrt(1/35 + 1/35) = 4.18, so there is considerable room for improvement. With 2 controls per case, it's 1/sqrt(1/35 + 1/70) = 4.83; a bit better. At 10 controls per case, we get 5.64, a substantial improvement but notice that it's not much different from the upper bound of 5.92. Suppose you use all 600 available controls. You get 5.75, which is scarcely better than 10 controls per case, and scarcely worse than an infinite number of controls. You see the diminishing returns from more controls?

            Now, given that you have a total of 635 entities available, if it were possible (and I understand that, in your context, it is not) to divide them almost evenly between cases and controls, this factor would be 1/sqrt(1/317 + 1/318) = 12.6. That's a huge improvement over anything that can be done when the number of cases is fixed at 35. The 35 cases is the real bound on power here. If that's all there is, then that's the reality and you live with it. (Mathematically it is not hard to prove that if n_cases + n_controls is some fixed value N, then 1/sqrt(1/n_cases + 1/n_controls) is maximized when n_cases = n_controls = N/2.)

            So you would choose the number of controls by trading off the benefit of more controls illustrated in the preceding paragraphs against the cost of obtaining data on more controls. In my work, when we are gathering new data on people, the cost per additional participant is typically quite high, especially in longitudinal studies. But if we're just pulling existing data from an electronic data base, the marginal cost of an extra data record is almost zero. If we're gathering data from existing paper records, it's somewhere between those.

            Applying these principles to your situation should give you a sense of how many controls to use given the constraint that you have just 35 cases no matter what, and in light of whatever effort and resources must be expended to acquire control data.
            Thank you for your answer, it is clearer now!

            Comment


            • #7
              Hi, Matteo: You might to check out the paper (along with Stata code) by Conley, Timothy G., and Christopher R. Taber. 2011. “Inference with 'Differences in Differences' with a Small Number of Policy Changes,” The Review of Economics and Statistics, 93(1), pp. 113-125. (Code to Download). http://economics.uwo.ca/people/faculty/conley.html
              Ho-Chuan (River) Huang
              Stata 17.0, MP(4)

              Comment


              • #8
                Originally posted by River Huang View Post
                Hi, Matteo: You might to check out the paper (along with Stata code) by Conley, Timothy G., and Christopher R. Taber. 2011. “Inference with 'Differences in Differences' with a Small Number of Policy Changes,” The Review of Economics and Statistics, 93(1), pp. 113-125. (Code to Download). http://economics.uwo.ca/people/faculty/conley.html
                Thank you, it looks really interesting!

                Comment

                Working...
                X