Hello Stata experts,
at the moment I'm working on a project that requires the use of 2SLS method with fixed-effects included. I'm struggling to make sense of the differences in the estimation results produced by Stata commands: ivregress, reghdfe, and ivreghdfe, and then to make a decision on which one should be used.
More specifically, I regress a production function with Y (measures firm performance) as the explained variable and regressors include production factors (x1 - x4) and (endogenous) city factors cityX5 and cityX6 (says, log of city population and immigration rate, they are my variables of interest). I employ a cross-sectional firm-level data set of a country.
To deal with the endogeneity of city factors, I use city-level variables z1 z2 to instrument for X5, and z3 and z4 to instrument for X6. I add the dummies of sate_industry to control region_industry fixed effects (each state has many cities).
Results produced by ivregress (with some unimportant results are dropped to keep the space):
Results produced by ivreghdfe:
Results produced by my manual calculation with reghdfe, I run first-stage and second-stage regressions then correct standard errors of estimated coefficients in the second-stage regression with bootstrap:
As you can see, coefficients of cityX5n and cityX6n are similar across three commands, but SEs are different.
The SEs produced by ivregress are quite close but a bit higher than those with manual calculation (esp cityX6n, SE is .3638874 with manual calculation, but only 0.2336223 through reghdfe, while SEs produced by ivreghdfe seem to be very low, which seems to be suspicious, it seems that they have problem with cluster option.
My questions are:
1. Do you know why many manual calculation does not give strictly similar results to what they are with ivregress?
2. Do you know what is the problem with ivreghdfe in this case? If you share my suspicion of the option cluster, do you suggest any solution?
The point is that, I would like to replace state_industry_pairs with dummies of state and industries at the finer level (more digit in the industry classification) to improve the accuracy of estimation, so there will be more pairs to estimate. In that case, ivregress could not handle the regression due to too many dummies (more than 3,000). The manual calculation can handle any types of dummies but it tends to produce higher SEs which can influence my final verdict on the role of cityX5 and cityX6, in addition, the bootstrap command is running quite slowly.... Finally, ivreghdfe is quickest, produces automatically important identification test, handle dummies very well, but gives values of SEs that are too good (in terms of expectation from theory) to be true. Hence, I really need to understand what is going on the differences in SEs and how to deal with it, because they should be all the same.
I look forward to receiving answers or advice from you to make sense out of this problem. Thanks in advance.
Best regards,
Cuong
at the moment I'm working on a project that requires the use of 2SLS method with fixed-effects included. I'm struggling to make sense of the differences in the estimation results produced by Stata commands: ivregress, reghdfe, and ivreghdfe, and then to make a decision on which one should be used.
More specifically, I regress a production function with Y (measures firm performance) as the explained variable and regressors include production factors (x1 - x4) and (endogenous) city factors cityX5 and cityX6 (says, log of city population and immigration rate, they are my variables of interest). I employ a cross-sectional firm-level data set of a country.
To deal with the endogeneity of city factors, I use city-level variables z1 z2 to instrument for X5, and z3 and z4 to instrument for X6. I add the dummies of sate_industry to control region_industry fixed effects (each state has many cities).
Results produced by ivregress (with some unimportant results are dropped to keep the space):
HTML Code:
. ivregress 2sls Y x1 x2 x3 x4 (cityX5 cityX6= z1 z2 z3 z4) i.state_industry_pairs if sample==1, vce(cluster state_id) note: 43.state_industry_pairs identifies no observations in the sample note: 44.state_industry_pairs identifies no observations in the sample note: 48.state_industry_pairs identifies no observations in the sample note: 60.state_industry_pairs identifies no observations in the sample ..... Instrumental variables (2SLS) regression Number of obs = 164,343 Wald chi2(1031) = 181743.21 Prob > chi2 = 0.0000 R-squared = 0.3889 Root MSE = 1.1872 (Std. Err. adjusted for 63 clusters in state_id) -------------------------------------------------------------------------------------- | Robust Y | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------------------+---------------------------------------------------------------- cityX5 | .0329823 .0152646 2.16 0.031 .0030642 .0629004 cityX6 | .1052269 .2336223 0.45 0.652 -.3526644 .5631183 x1 | .2926192 .0102266 28.61 0.000 .2725754 .312663 x2 | .0316666 .0350465 0.90 0.366 -.0370233 .1003565 x3 | .0694911 .0182037 3.82 0.000 .0338126 .1051696 x4 | .6947005 .0578016 12.02 0.000 .5814115 .8079895 | state_industry_pairs | 2 | 1.686815 .0173291 97.34 0.000 1.65285 1.720779 3 | 1.727269 .0193602 89.22 0.000 1.689324 1.765214 4 | .918509 .0260839 35.21 0.000 .8673854 .9696326 5 | 1.665119 .0484384 34.38 0.000 1.570181 1.760056 6 | 2.387751 .0417277 57.22 0.000 2.305967 2.469536 7 | 1.261545 .0535408 23.56 0.000 1.156607 1.366484 8 | .7721015 .0317133 24.35 0.000 .7099446 .8342585 9 | 1.387479 .062664 22.14 0.000 1.26466 1.510298 .... | _cons | 2.09025 .2286741 9.14 0.000 1.642057 2.538443 -------------------------------------------------------------------------------------- Instrumented: cityX5 cityX6 Instruments: x1 x2 x3 x4 2.state_industry_pairs 3.state_industry_pairs 4.state_industry_pairs 5.state_industry_pairs 6.state_industry_pairs 7.state_industry_pairs 8.state_industry_pairs 9.state_industry_pairs 10.state_industry_pairs 11.state_industry_pairs 12.state_industry_pairs 13.state_industry_pairs 14.state_industry_pairs 15.state_industry_pairs 16.state_industry_pairs 17.state_industry_pairs 18.state_industry_pairs 19.state_industry_pairs 20.state_industry_pairs 21.state_industry_pairs ...
HTML Code:
. ivreghdfe Y x1 x2 x3 x4 (cityX5 cityX6= z1 z2 z3 z4) if sample==1, absorb(state_industry_pairs) vce(cluster state_id) (dropped 55 singleton observations) (MWFE estimator converged in 1 iterations) IV (2SLS) estimation -------------------- Estimates efficient for homoskedasticity only Statistics consistent for homoskedasticity only Number of obs = 164288 F( 6,164281) = 4595.61 Prob > F = 0.0000 Total (centered) SS = 270763.1577 Centered R2 = 0.1445 Total (uncentered) SS = 270763.1577 Uncentered R2 = 0.1445 Residual SS = 231627.2165 Root MSE = 1.187 ------------------------------------------------------------------------------ Y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cityX5 | .0329823 .0067151 4.91 0.000 .0198209 .0461438 cityX6 | .1052269 .072774 1.45 0.148 -.0374086 .2478625 x1 | .2926192 .0020022 146.15 0.000 .2886949 .2965435 x2 | .0316666 .0015753 20.10 0.000 .0285791 .0347541 x3 | .0694911 .0070562 9.85 0.000 .0556611 .0833211 x4 | .6947005 .013997 49.63 0.000 .6672667 .7221343 ------------------------------------------------------------------------------ Underidentification test (Anderson canon. corr. LM statistic): 4.3e+04 Chi-sq(3) P-val = 0.0000 ------------------------------------------------------------------------------ Weak identification test (Cragg-Donald Wald F statistic): 1.4e+04 Stock-Yogo weak ID test critical values: 5% maximal IV relative bias 11.04 10% maximal IV relative bias 7.56 20% maximal IV relative bias 5.57 30% maximal IV relative bias 4.73 10% maximal IV size 16.87 15% maximal IV size 9.93 20% maximal IV size 7.54 25% maximal IV size 6.28 Source: Stock-Yogo (2005). Reproduced by permission. ------------------------------------------------------------------------------ Sargan statistic (overidentification test of all instruments): 0.064 Chi-sq(2) P-val = 0.9687 ------------------------------------------------------------------------------ Instrumented: cityX5 cityX6 Included instruments: x1 x2 x3 x4 Excluded instruments: z1 z2 z3 z4 Partialled-out: _cons nb: total SS, model F and R2s are after partialling-out; any small-sample adjustments include partialled-out variables in regressor count K ------------------------------------------------------------------------------ Absorbed degrees of freedom: --------------------------------------------------------------+ Absorbed FE | Categories - Redundant = Num. Coefs | ----------------------+---------------------------------------| state_industry_pairs | 971 971 0 *| --------------------------------------------------------------+ * = FE nested within cluster; treated as redundant for DoF computation
HTML Code:
program tsls_test 1. quietly: reghdfe cityX5 z1 z2 z3 z4 x1 x2 x3 x4 if sample==1, absorb(state_industry_pairs) vce(cluster state_id) 2. predict cityX5n, xb 3. quietly: reghdfe cityX6 z1 z2 z3 z4 x1 x2 x3 x4 if sample==1, absorb(state_industry_pairs) vce(cluster state_id) 4. predict cityX6n, xb 5. quietly: reghdfe y x1 x2 x3 x4 cityX5n cityX6n if sample==1, absorb(state_industry_pairs) vce(cluster state_id) 6. drop cityX5n cityX6n 7. end bootstrap, cluster(state_id) reps(250): tsls_test (running tsls_test on estimation sample) Bootstrap replications (250) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .................................................. 100 .................................................. 150 .................................................. 200 .................................................. 250 HDFE Linear regression Number of obs = 164,288 Absorbing 1 HDFE group Replications = 250 Wald chi2(6) = 4202.52 Prob > chi2 = 0.0000 R-squared = 0.3821 Adj R-squared = 0.3784 Root MSE = 1.1916 (Replications based on 63 clusters in state_id) ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1 | .2926192 .0104536 27.99 0.000 .2721305 .3131079 x2 | .0316666 .036746 0.86 0.389 -.0403543 .1036874 x3 | .0694911 .0176274 3.94 0.000 .034942 .1040402 x4 | .6947005 .0529963 13.11 0.000 .5908297 .7985713 cityX5n | .0329823 .015012 2.20 0.028 .0035593 .0624054 cityX6n | .105227 .3638874 0.29 0.772 -.6079792 .8184331 _cons | 4.080161 .2575816 15.84 0.000 3.575311 4.585012 ------------------------------------------------------------------------------ Absorbed degrees of freedom: --------------------------------------------------------------+ Absorbed FE | Categories - Redundant = Num. Coefs | ----------------------+---------------------------------------| state_industry_pairs | 971 971 0 *| --------------------------------------------------------------+ * = FE nested within cluster; treated as redundant for DoF computation
The SEs produced by ivregress are quite close but a bit higher than those with manual calculation (esp cityX6n, SE is .3638874 with manual calculation, but only 0.2336223 through reghdfe, while SEs produced by ivreghdfe seem to be very low, which seems to be suspicious, it seems that they have problem with cluster option.
My questions are:
1. Do you know why many manual calculation does not give strictly similar results to what they are with ivregress?
2. Do you know what is the problem with ivreghdfe in this case? If you share my suspicion of the option cluster, do you suggest any solution?
The point is that, I would like to replace state_industry_pairs with dummies of state and industries at the finer level (more digit in the industry classification) to improve the accuracy of estimation, so there will be more pairs to estimate. In that case, ivregress could not handle the regression due to too many dummies (more than 3,000). The manual calculation can handle any types of dummies but it tends to produce higher SEs which can influence my final verdict on the role of cityX5 and cityX6, in addition, the bootstrap command is running quite slowly.... Finally, ivreghdfe is quickest, produces automatically important identification test, handle dummies very well, but gives values of SEs that are too good (in terms of expectation from theory) to be true. Hence, I really need to understand what is going on the differences in SEs and how to deal with it, because they should be all the same.
I look forward to receiving answers or advice from you to make sense out of this problem. Thanks in advance.
Best regards,
Cuong
Comment