Abadie, Athey, Imbens, Wooldridge level of clustering

William Park

Join Date: Jan 2024

Posts: 5
#1

Abadie, Athey, Imbens, Wooldridge level of clustering

14 Jan 2024, 05:20

Suppose I have data on all companies in the entire country (Let's say 1M companies). So it's just nearly an entire population.

I want to run DID that studies the impact of a shock that happened in year 2000 on companies' log sales.

Each company belongs to one of 100 industries.

There is a measure of an exposure to the shock at the industry level. Call it "industry_exposure". What this means is that all companies that belong to the same industry will share the same level of "industry_exposure".

For those 1M companies, I run this

Code:

reghdfe lsales industry_exposure post industry_characterXpost , absorb(id year) cluster(???)

where post takes 1 if year>=2000, and industry_exposureXpost is the DID term of interest, and "id" is the company id.

I have company FE and year FE.

At which level should I cluster?

I read Prof Wooldridge's answer here @Jeff Wooldridge
Appropriate Dimension for Clustering of Standard Errors - Statalist

So there are two questions to ask.
1. Have the data been obtained from cluster sampling?
2. What is the level of assignment of the key explanatory variables?

Regarding #1, it's not cluster sampled. It's almost entire full set of companies in a country. So this question #1 doesn't justify any clustering (right?).

Regarding #2, does the answer depend on whether I take a sample or I use basically the entire population? I take the (almost) entire population of companies.

Last edited by William Park; 14 Jan 2024, 05:29.
Tags: cluster, cluster robust se
William Park

Join Date: Jan 2024

Posts: 5
#2

14 Jan 2024, 05:21

Also, other extended regressions I want to run are triple diff. There is a place-based policy in 2000 "placepolicy2000" which differ by states.

I want to study if the industry_exposure has different effects for different states with different place_policy in 2000. So I run this

Code:

reghdfe lsales industry_exposure post industry_exposureXplacepolicy2000Xpost placepolicy2000Xpost industry_exposureXplacepolicy2000 industry_exposureXpost, absorb(id year) cluster(???)

Similarly, I also want to know if the impact depends on company's size in 2020 "size2000". So I run this.

Code:

reghdfe lsales industry_exposure post industry_exposureXsize2000Xpost size2000Xpost industry_exposureXsize2000 industry_exposureXpost, absorb(id year) cluster(???)

At which level should I cluster in each case?
Comment
William Park

Join Date: Jan 2024

Posts: 5
#3

14 Jan 2024, 09:43

In Abadie, Athey, Imbens, Wooldridge, in Section "VII. Implications for Practice", there are 4 practical cases they are explaining.

1. "First, we discuss the case where there is no cluster sampling."

2. "Next consider the case of clustered assignment and where we either have random sampling or observe the entire population."

3. "Another reason to cluster standard errors is cluster sampling."

4. "Consider a setting with unit-level panel data on outcomes and a treatment that is implemented on the same period for all units in the treatment group. In this case, the difference-in-differences estimator is equal to the coefficient on the treatment variable in a regression of the change in average outcomes between the post-treatment and the pretreatment periods on a constant and a treatment indicator."

I am confused which case my example belongs to.

Is it 1st one because there is no cluster sampling?

Is it 2nd one because my "sample" is nearly entire population of companies? But does my example have "clustered assignment"?

Is it 4th one because my example is DID?

I guess it's not 3rd one because there is no cluster sampling.
Comment

Announcement

Abadie, Athey, Imbens, Wooldridge level of clustering

Comment

Comment