Staggered DID with high dimensional data

Havard Rydland

Join Date: Oct 2024

Posts: 2
#1

Staggered DID with high dimensional data

30 Oct 2024, 05:31

Dear Statalist,

I use Stata 16.1 MP and wish to study how a ownership change at the workplace level affects an outcome at the employee level. My dataset consists of app. 175,000 employees in 6,000 workplaces. Employees are observed monthly (conditional on their employment) from 2015 through 2019, which gives me a total of app. 5,000,000 observations. Both employees and workplaces have an unique ID. Ownership change occurs throughout the study period, which gives me 60 time periods and 59 treatment cohorts. I have covariates at different levels (eg. gender, marital status, and workplace size). A staggered diff-in-diff design and the commands csdid or jwdid (both SSC) seems most appropriate (correct me if I'm wrong!).

Due to privacy reasons, I can't share a dataex of the original data with you, but here is a toy example to get an idea of the structure:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(eid time wid treat owner cohort y x1 x2 x3) 1 1 1 1 1 4 9.906464 .8840705 1 19 1 2 1 1 1 4 4.7443843 .1828533 1 20 1 3 1 1 1 4 2.8065584 .1280229 1 20 1 4 1 1 2 4 3.2066376 .6478783 1 20 1 5 1 1 2 4 .3065564 .8508041 1 20 2 1 1 1 1 4 4.385781 .6063875 0 19 2 2 1 1 1 4 3.988955 .322536 0 20 2 3 1 1 1 4 .23852494 .8962572 0 20 2 4 1 1 2 4 9.442598 .7763369 0 20 2 5 1 1 2 4 1.143186 .6614004 0 20 3 1 2 1 1 2 2.96908 .8210933 1 15 3 2 2 1 2 2 6.833038 .06343113 1 15 3 3 2 1 2 2 5.724059 .7743081 1 15 3 4 2 1 2 2 1.9303285 .4983811 1 17 3 5 2 1 2 2 7.387565 .6873622 1 17 4 1 3 0 1 0 4.0051527 .4709546 0 25 4 2 3 0 1 0 1.1618755 .14672393 0 25 4 3 3 0 1 0 8.574128 .7951593 0 22 4 4 3 0 1 0 6.941094 .7541171 0 22 4 5 3 0 1 0 4.971269 .8676175 0 22 5 1 3 0 1 0 5.524818 .62491 0 25 5 2 3 0 1 0 1.9233507 .10263631 0 25 6 1 3 0 1 0 5.307955 .6798756 1 25 6 2 3 0 1 0 2.655255 .8680485 1 25 6 3 3 0 1 0 2.7890375 .8407795 1 22 7 3 2 1 2 2 9.030829 .4029677 1 15 7 4 2 1 2 2 3.479616 .6724598 1 17 7 5 3 0 1 0 1.93837 .7535322 1 22 end label values owner owner label def owner 1 "public", modify label def owner 2 "private", modify

Since I have observations nested in employees nested in workplaces I get an error message (r(451); repeated time values within panel) when trying to estimate with cdid using workplace ID as the panel identifier.

Code:

csdid y x1 x2 x3, ivar(wid) time(time) gvar(cohort)

Will the repeated cross-section estimator still apply workplace FE? (not applicable to the example data)

Code:

csdid y x1 x2 x3, cluster(wid) time(time) gvar(cohort)

When I apply the CS option on my real data, estimation is very slow. Is jwdid a better estimator considering my "messy" panel structure? Or will the I run into the same problem with regards to estimation time? Aggregetion to a higher time or ID unit is of course an option, but I also wish to keep as much variation as possible.

Hope this was clear. All suggestions are appreciated.
Tags: None
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

30 Oct 2024, 10:31

This wasn't very clear, but repeated time values simply means you've misspecified your panel setup. You need to use the lowest level of aggregation and time (employee level).
Comment
Havard Rydland

Join Date: Oct 2024

Posts: 2
#3

31 Oct 2024, 02:54

Apologies for being unclear!

I do not want employee fixed effects, which I assume follows from choosing employee ID as the panel identifier in csdid (correct me if I'm wrong). So unless I do some sort of aggregation, I guess panel data estimators are out of the question. But the issue with slow estimation due to the high number of time periods and cohorts still remains.
Comment

Announcement

Staggered DID with high dimensional data

Comment

Comment