Dear Statalist users,
I am writing this post to ask for your help in determining the level of fixed effect to be controlled in linear panel model, suggested in Papke and Wooldridge (2023) - A simple, robust test for choosing the level of fixed effects in linear panel data models (link)
(I am studying the effect of village-level treatment on household-level outcomes, so trying to figure out whether I should include household- or village-FE)
I was able to follow the procedure with Stata, using NLSY data as an example. But have two questions to check the procedure and final steps.
Suppose I estimate the effect of hours worked on ln(wage) as below, with a vector of controls including time-varying (age and weeks_worked) and time-invarying (race) at individual-level.
lnwageit = b0 + b1*hoursit + b2*Xit + (Fixed Effects)
Denote b1hatiFE and b1hatgFE as the estimates of b1 under unit-FE and group- FE.
My goal is to test whether b1 (coefficient on hoursit) is robust to the choice of fixed effect (i.e. b1hatiFE = b1hatgFE); individual-level or group-level (industry-level in this example)
Here's the procedure suggested in the paper (procedure 3.2 in Chapter 3.3 "Testing a single coefficient")
Step 1: Run unit-FE regression with time dummies and controls, and obtain the residuals. Repeat it with group-FE.
Step 2. Run unit-FE regression of the variable of interest (hoursit in this example) on time dummies and controls, and obtain the residuals. Repeat it with group-FE.
Step 3. Compute the average of (unit-FE residuals)2 (from step 2) across i and t, and the average of (group-FE residuals)2 (from step 2) across i and t.
Step 4. Construct q_hat, the difference in {(residuals from step 2) * (residuals from step 1) / (the average from step 3)} between unit-FE model and group-FE model (equation 3.16 in the paper)
Step 5. Obtain SE(b1hatiFE - b1hatgFE), the standard error of (b1hatiFE - b1hatgFE), from regressing q_hat (from step 4) on the constant value 1, probably clustering at the group-level or at least at the individual-level. The single estimated coefficient will be identically zero.
Then the paper wrote we can use a t statistic version of the Hausman test, obtained as (b1hatiFE - b1hatgFE) / SE(b1hatiFE - b1hatgFE), to test whether we can use individual-FE or group-FE.
Here's the Stata code I used to replicate the procedure.
I would like to ask two questions about this procedure.
1. In step 5, how do I obtain SE(b1hatiFE - b1hatgFE) from regressing q_hat on 1? Is it same as the standard error of the coefficient on 1, as I did in the code above? I assume it is, based on what authors wrote in the previous section ("...and the cluster-robust variance-covariance matrix will be V1_hat"), but want to double-check this.
2. Once I computed the final t-statistic, can I just interpret it as regular t-statistic reported in regression? For example, if my t-statistic is greater than 1.96, I can reject the null hypothesis that unit_FE estimator and group-FE estimator are the same at p=0.05, and stick to unit-FE estimator?
Any comments are greatly appreciated.
Thank you.
I am writing this post to ask for your help in determining the level of fixed effect to be controlled in linear panel model, suggested in Papke and Wooldridge (2023) - A simple, robust test for choosing the level of fixed effects in linear panel data models (link)
(I am studying the effect of village-level treatment on household-level outcomes, so trying to figure out whether I should include household- or village-FE)
I was able to follow the procedure with Stata, using NLSY data as an example. But have two questions to check the procedure and final steps.
Suppose I estimate the effect of hours worked on ln(wage) as below, with a vector of controls including time-varying (age and weeks_worked) and time-invarying (race) at individual-level.
lnwageit = b0 + b1*hoursit + b2*Xit + (Fixed Effects)
Denote b1hatiFE and b1hatgFE as the estimates of b1 under unit-FE and group- FE.
My goal is to test whether b1 (coefficient on hoursit) is robust to the choice of fixed effect (i.e. b1hatiFE = b1hatgFE); individual-level or group-level (industry-level in this example)
Here's the procedure suggested in the paper (procedure 3.2 in Chapter 3.3 "Testing a single coefficient")
Step 1: Run unit-FE regression with time dummies and controls, and obtain the residuals. Repeat it with group-FE.
Step 2. Run unit-FE regression of the variable of interest (hoursit in this example) on time dummies and controls, and obtain the residuals. Repeat it with group-FE.
Step 3. Compute the average of (unit-FE residuals)2 (from step 2) across i and t, and the average of (group-FE residuals)2 (from step 2) across i and t.
Step 4. Construct q_hat, the difference in {(residuals from step 2) * (residuals from step 1) / (the average from step 3)} between unit-FE model and group-FE model (equation 3.16 in the paper)
Step 5. Obtain SE(b1hatiFE - b1hatgFE), the standard error of (b1hatiFE - b1hatgFE), from regressing q_hat (from step 4) on the constant value 1, probably clustering at the group-level or at least at the individual-level. The single estimated coefficient will be identically zero.
Then the paper wrote we can use a t statistic version of the Hausman test, obtained as (b1hatiFE - b1hatgFE) / SE(b1hatiFE - b1hatgFE), to test whether we can use individual-FE or group-FE.
Here's the Stata code I used to replicate the procedure.
Code:
use https://www.stata-press.com/data/r18/nlswork, clear lobal Y ln_wage global T hours global G ind_code // industry-identifier global Xs age race wks_work * (0) Make balanced panel data by keeping balanced individuals only (not strongly required, but for convenience) * Keep only observations which all variables in the regression are non-missing. egen num_missing = rowmiss(${Y} ${T} ${G} ${Xs}) keep if num_missing==0 * Keep only individuals surveyed across all years keep if inrange(year,68,73) // Keep it shorter to make a larger sample of balanced data bys idcode: egen num_surveyed = count(ln_wage) keep if num_surveyed==6 * (1-1) Run individual-FE model with time dummies, and get residuals (SE clustered at individual-level) xtset idcode year // idcode is unit identifier xtreg ${Y} ${T} ${Xs} i.year, fe vce(cluster idcode) scalar b1hat_iFE=e(b)[1,1] // indivdiual-FE estimator predict uhat_iFE_resid, residual // residuals * (1-2) Run industry-FE model with time dummies, and get residuals (SE clustered at individual-level) reg ${Y} ${T} ${Xs} i.${G} i.year, vce(cluster idcode) scalar b1hat_gFE=e(b)[1,1] // group-FE estimator predict uhat_gFE_resid, residual * (2-1) Run unit-level FE regression of T on time dummies and covariates, and get residuals (x_doubledot) xtreg ${T} ${Xs} i.year, fe vce(cluster idcode) predict x_doubledot, residual * (2-1) Run group-level FE regression of T on time dummies and covariates, and get residuals (x_singledot) reg ${T} ${Xs} i.${G} i.year, vce(cluster idcode) predict x_singledot, residual * (3) Compute the average of (x_doubledot)^2 (ahat_doubledot) across all i and t gen x_doubledot_sq = (x_doubledot)^2 egen ahat_doubledot = mean(x_doubledot_sq) * Compute the average of (x_singledot)^2 (ahat_singledot) across all i and t gen x_singledot_sq = (x_singledot)^2 egen ahat_singledot = mean(x_singledot_sq) * (4) Compute q_hat (equation 3.16) gen qhat = ((x_doubledot * uhat_iFE_resid) / ahat_doubledot) /// - ((x_singledot * uhat_gFE_resid) / ahat_singledot) * (5) Obtain SE(b1hatiFE - b1hatgFE) by regressing qhat on 1 (cosntant), clustering at individual-level. gen vector_1 = 1 reg qhat vector_1, vce(cluster idcode) scalar SE_delta = sqrt(e(V)[2,2]) // SE(b1hatiFE - b1hatgFE) * (6) Compute t-statistic scalar t = (b1hat_iFE - b1hat_gFE) / SE_delta scalar list t // Show t-statistic computed
I would like to ask two questions about this procedure.
1. In step 5, how do I obtain SE(b1hatiFE - b1hatgFE) from regressing q_hat on 1? Is it same as the standard error of the coefficient on 1, as I did in the code above? I assume it is, based on what authors wrote in the previous section ("...and the cluster-robust variance-covariance matrix will be V1_hat"), but want to double-check this.
2. Once I computed the final t-statistic, can I just interpret it as regular t-statistic reported in regression? For example, if my t-statistic is greater than 1.96, I can reject the null hypothesis that unit_FE estimator and group-FE estimator are the same at p=0.05, and stick to unit-FE estimator?
Any comments are greatly appreciated.
Thank you.