Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Abadie, Athey, Imbens, Wooldridge level of clustering

    Suppose I have data on all companies in the entire country (Let's say 1M companies). So it's just nearly an entire population.

    I want to run DID that studies the impact of a shock that happened in year 2000 on companies' log sales.

    Each company belongs to one of 100 industries.

    There is a measure of an exposure to the shock at the industry level. Call it "industry_exposure". What this means is that all companies that belong to the same industry will share the same level of "industry_exposure".

    For those 1M companies, I run this

    Code:
    reghdfe   lsales    industry_exposure    post   industry_characterXpost , absorb(id   year)    cluster(???)
    where post takes 1 if year>=2000, and industry_exposureXpost is the DID term of interest, and "id" is the company id.

    I have company FE and year FE.

    At which level should I cluster?



    I read Prof Wooldridge's answer here @Jeff Wooldridge
    Appropriate Dimension for Clustering of Standard Errors - Statalist

    So there are two questions to ask.
    1. Have the data been obtained from cluster sampling?
    2. What is the level of assignment of the key explanatory variables?

    Regarding #1, it's not cluster sampled. It's almost entire full set of companies in a country. So this question #1 doesn't justify any clustering (right?).

    Regarding #2, does the answer depend on whether I take a sample or I use basically the entire population? I take the (almost) entire population of companies.
    Last edited by William Park; 14 Jan 2024, 06:29.

  • #2
    Also, other extended regressions I want to run are triple diff. There is a place-based policy in 2000 "placepolicy2000" which differ by states.

    I want to study if the industry_exposure has different effects for different states with different place_policy in 2000. So I run this

    Code:
    reghdfe   lsales    industry_exposure    post   industry_exposureXplacepolicy2000Xpost     placepolicy2000Xpost    industry_exposureXplacepolicy2000       industry_exposureXpost, absorb(id   year)    cluster(???)
    Similarly, I also want to know if the impact depends on company's size in 2020 "size2000". So I run this.

    Code:
    reghdfe   lsales    industry_exposure    post   industry_exposureXsize2000Xpost     size2000Xpost    industry_exposureXsize2000       industry_exposureXpost, absorb(id   year)    cluster(???)

    At which level should I cluster in each case?

    Comment


    • #3
      In Abadie, Athey, Imbens, Wooldridge, in Section "VII. Implications for Practice", there are 4 practical cases they are explaining.

      1. "First, we discuss the case where there is no cluster sampling."

      2. "Next consider the case of clustered assignment and where we either have random sampling or observe the entire population."

      3. "Another reason to cluster standard errors is cluster sampling."

      4. "Consider a setting with unit-level panel data on outcomes and a treatment that is implemented on the same period for all units in the treatment group. In this case, the difference-in-differences estimator is equal to the coefficient on the treatment variable in a regression of the change in average outcomes between the post-treatment and the pretreatment periods on a constant and a treatment indicator."


      I am confused which case my example belongs to.

      Is it 1st one because there is no cluster sampling?

      Is it 2nd one because my "sample" is nearly entire population of companies? But does my example have "clustered assignment"?

      Is it 4th one because my example is DID?

      I guess it's not 3rd one because there is no cluster sampling.

      Comment

      Working...
      X