Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Few treated, many controls

    Dear All,

    Thank you for your attention.

    I am conducting a TWFE DID analysis with panel data. My issue is that the treated group is significantly smaller than the control group. In 8 years of yearly data, only a dozen or even fewer observations are treated at different years, but thousands are untreated.

    In general, a regression would require at least 20 observations to make statistical sense. The number of treated is certainly too small to run a regression by themselves. But with the supplement of the thousands of controls, does the DID estimation make statistical sense (such as the LLN)?

    Thanks!

  • #2
    The problem is that with only a dozen treated observations, your standard errors are going to be very wide, not withstanding the thousands of controls. Your analysis is almost certain to be under-powered. You can run it if you like, but you will never know if the non-significant results you are likely to obtain are due to absence of an effect or insufficient data on the treatment condition.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      The problem is that with only a dozen treated observations, your standard errors are going to be very wide, not withstanding the thousands of controls. Your analysis is almost certain to be under-powered. You can run it if you like, but you will never know if the non-significant results you are likely to obtain are due to absence of an effect or insufficient data on the treatment condition.
      Thank you for the prompt reply!

      The results of DID estimation are significant at 5% level though. Does that mean the lack of power is not a problem in my data?

      Comment


      • #4
        Well, it makes it a different kind of problem. Power is a pre-analysis construct and is calculated without reference to the results. Whether the results are statistically significant or not does not change the power of the analysis. The problem with low-powered analyses is that, compared to well powered analyses, statistically significant results are more likely to be strongly over-estimated, and also have an increased chance of even having the wrong sign! I don't have time to explain it here, but the attached article does it well. gelman-carlin-2014-beyond-power-calculations-assessing-type-s-(sign)-and-type-m-(magnitude)-errors.pdf.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          Well, it makes it a different kind of problem. Power is a pre-analysis construct and is calculated without reference to the results. Whether the results are statistically significant or not does not change the power of the analysis. The problem with low-powered analyses is that, compared to well powered analyses, statistically significant results are more likely to be strongly over-estimated, and also have an increased chance of even having the wrong sign! I don't have time to explain it here, but the attached article does it well. [ATTACH]n1756173[/ATTACH].
          Thank you for the explanation.

          One more question please: if I have twenty treated, and randomly select only 100 controls, will it be better?

          Comment


          • #6
            You could use some type of matching on the X's to create a 1:n match. kmatch has a nn[#] option.

            the treatment in different years opens a can of worms though.

            Comment


            • #7
              Originally posted by George Ford View Post
              You could use some type of matching on the X's to create a 1:n match. kmatch has a nn[#] option.

              the treatment in different years opens a can of worms though.
              Thank you!

              Comment


              • #8
                I'm not sure that creating matched pairs solves the problem. I think the fundamental difficulty this data set faces is the small number in the treatment group. The large number of controls poses no difficulty as far as I can see.

                I suppose one might be concerned that with only 12 treated cases that there is a fair chance that this group, even if accrued by random sampling from a population, might fail to be representative of the target population about which inference is to be made. Matching controls will mitigate confounding that might result from that, but then the matched pairs will still represent an unrepresentative sample of the population.

                Comment


                • #9
                  Originally posted by Clyde Schechter View Post
                  I'm not sure that creating matched pairs solves the problem. I think the fundamental difficulty this data set faces is the small number in the treatment group. The large number of controls poses no difficulty as far as I can see.

                  I suppose one might be concerned that with only 12 treated cases that there is a fair chance that this group, even if accrued by random sampling from a population, might fail to be representative of the target population about which inference is to be made. Matching controls will mitigate confounding that might result from that, but then the matched pairs will still represent an unrepresentative sample of the population.
                  I see. Is there a min number of the treated, such as 20? Or 30?

                  Comment


                  • #10
                    There is no minimum, save the desire to have a decent sample size.

                    I think Clyde's concern regards whether your estimated effect is generalizable to the population (ATE) or restricted just to the treated observations (ATET). Why are so few treated? Is it just some freaky thing in the data where you have thousands of observations, all of which could be treated, but only 12 treated units end up in the sample? That is very strange. I wouldn't believe the results no matter what you did. Or, is your control group too broad?

                    Why are so few treated?



                    Comment


                    • #11
                      Originally posted by Clyde Schechter View Post
                      Well, it makes it a different kind of problem. Power is a pre-analysis construct and is calculated without reference to the results. Whether the results are statistically significant or not does not change the power of the analysis. The problem with low-powered analyses is that, compared to well powered analyses, statistically significant results are more likely to be strongly over-estimated, and also have an increased chance of even having the wrong sign! I don't have time to explain it here, but the attached article does it well. [ATTACH]n1756173[/ATTACH].
                      See also:

                      Button, K., Ioannidis, J., Mokrysz, C. et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci 14, 365–376 (2013). https://www.nature.com/articles/nrn3475
                      Key Points

                      • Low statistical power undermines the purpose of scientific research; it reduces the chance of detecting a true effect.
                      • Perhaps less intuitively, low power also reduces the likelihood that a statistically significant result reflects a true effect.
                      • Empirically, we estimate the median statistical power of studies in the neurosciences is between ∼8% and ∼31%.
                      • We discuss the consequences of such low statistical power, which include overestimates of effect size and low reproducibility of results.
                      • There are ethical dimensions to the problem of low power; unreliable research is inefficient and wasteful.
                      • Improving reproducibility in neuroscience is a key priority and requires attention to well-established, but often ignored, methodological principles.
                      • We discuss how problems associated with low power can be addressed by adopting current best-practice and make clear recommendations for how to achieve this.

                      --
                      Bruce Weaver
                      Email: [email protected]
                      Version: Stata/MP 18.5 (Windows)

                      Comment


                      • #12
                        There is no fixed cut off number. Smaller is worse; larger is better.

                        Here's how I would think about this. Why are there only 12 treated cases in your data? Perhaps that is all there are in the world. One might then wonder why study this particular treatment at all, but perhaps there is a justification for that because it is groundbreaking in some important sense. More likely, there are more of them out there, but getting their data may not be feasible. In that case, I would write up my results, including an explanation of why the treated sample is so small, and being appropriately modest about any conclusions drawn. Probably you won't be able to publish in a top-tier journal with that, but you can place it somewhere. Or perhaps you could, in fact, with reasonable effort, get data on more cases. If so, I would do that and then revise the work to get more persuasive results. I would probably aim for the largest feasible number of cases under whatever constraints you live with. (I'm assuming here that the largest feasible number is not going to be all that much bigger than the 12 you already have, at least not so much bigger that you run into the opposite problem where every difference, even ones way too small for anybody to care about, become statistically significant as a result of gargantuan sample size.)

                        Comment


                        • #13
                          Originally posted by Clyde Schechter View Post
                          There is no fixed cut off number. Smaller is worse; larger is better.

                          Here's how I would think about this. Why are there only 12 treated cases in your data? Perhaps that is all there are in the world. One might then wonder why study this particular treatment at all, but perhaps there is a justification for that because it is groundbreaking in some important sense. More likely, there are more of them out there, but getting their data may not be feasible. In that case, I would write up my results, including an explanation of why the treated sample is so small, and being appropriately modest about any conclusions drawn. Probably you won't be able to publish in a top-tier journal with that, but you can place it somewhere. Or perhaps you could, in fact, with reasonable effort, get data on more cases. If so, I would do that and then revise the work to get more persuasive results. I would probably aim for the largest feasible number of cases under whatever constraints you live with. (I'm assuming here that the largest feasible number is not going to be all that much bigger than the 12 you already have, at least not so much bigger that you run into the opposite problem where every difference, even ones way too small for anybody to care about, become statistically significant as a result of gargantuan sample size.)
                          Much appreciated.

                          Actually, there are 30 treated in the beginning, but many didn't have common support in a classical DID with IPW weighted matching. Thus, it seems impropre to incluse them in a TWFE either....

                          Comment


                          • #14
                            The exclusion of observations outside of common support is one of the serious drawbacks to propensity score matching. That's why I rarely use it any more. I find propensity weighting preferable. You pweight each case by 1/propensity score and each control by 1 - 1/propensity score. Then you regress the entire sample. No data loss.

                            Now, sometimes doing that you find some observations with extremely low propensity scores (or controls with propensity scores very close to 1) which leads to these observations having huge weights. That can be anxiety provoking. Some people recommend putting a cap on these weights at 10 or 15. In my own experience, I have found that whether you cap the weights or just use them as they come, you don't see much difference in the regression results.

                            But before you run off and try that, I want to deliver another note of caution. If you lost 18 out of 30 cases due to lack of common support it says that your cases are really very different from your controls on the variables that went into your propensity score estimation. I don't see group differences that extreme very much. To me, it's a red flag. If they are that extremely different, I would be very worried that they are also radically different on things we can't observe (and which are therefore not adjusted for by the propensity matching.) Is there something different about the way you sampled treated cases and controls that might have induced this huge difference? Could it be due to data errors? My intuition is that something is wrong, perhaps seriously wrong, here. I would look into this carefully.

                            Comment


                            • #15
                              Thank you, all very useful insights.

                              Comment

                              Working...
                              X