Dear Sebastian Kripfganz and other wonderful Statalisters,
First, thanks for your invaluable contribution to the community by offering solutions to the challenges faced by users like me! This forum is my go-to place for everything Stata.
That said, I am encountering some challenges modelling my dataset, using system GMM estimation technique. My dataset measures Sale of roughly 10k employees over 60 months (unbalanced panel). I'm using the following syntax:
xi: xtabond2 Sale L.Sale L.Dperform Dtarget MGrwth Incentv1 Incentv2 c.L.Sale#c.Incentv1 c.L.Sale#i.Incentv2 zone* yr*, gmm(L.Sale L.Dperform Dtarget MGrwth c.L.Sale#c.L.Dperform c.L.Sale#c.Incentv1 c.L.Sale#i.Incentv2, collapse) iv(zone* yr*, eq(level)) or twostep rob small
Where, Sale is the dependent variable and L.Sale is included as a regressor. L.Dperform is lag of division's performance (% target achieved), Dtarget is the target for employee's division for the period, and MGrwth is market growth of employee's division (month-over-month). Incentv1 (continuous) and Incentv2 (dummy) are employee incentives, and zone* and yr* are dummies for zone and year, respectively. L.Sale is endogenous for obvious reasons. Hence, I'm treating it's interactions with incentives endogenous as well. Further, L.Dperform DTarget and MGrwth are also endogenous or predetermined. I use robust SE as my employees are nested within divisions. However, when I estimate this model, I encounter the following challenges:
1. First, I get a warning message : "Warning: Two-step estimated covariance matrix of moments is singular. Using a generalized inverse to calculate optimal weighting matrix for two-step estimation." – is it something to be worried about?
2. Next, while I use collapse option to restrict the number of instruments, the model still uses a large number of instruments (over 400). Although, less than the number of groups (over 5000) in my dataset. – My assumption is as far as the number of instruments is lower than the number of groups, it is acceptable. However, is 400+ still too large?
3. My understanding is that I should include all variables that are either endogenous or predetermined in the gmmstyle option (and therefore all interaction with these variables). Isn't that correct? If yes, do I include these endogenous/predetermined variables in the gmm option just as specified in the regressors (i.e. the syntax creates the lagged difference by itself)? Or, should I include only the lags of these variables, i.e., if DTarget is included as a regressor, should I include Dtarget or L.Dtarget in the gmm option? Also, do I include all exogenous regressors in the ivstyle option or only time-specific dummies?
4. The AR(2) test for my model is not significant, suggesting no second-order serial correlation in the first-differenced model. However, both the Sargan test and the Hansen test are highly significant, suggesting that my instruments are not good?
------------------------------------------------------------------------------
Arellano-Bond test for AR(1) in first differences: z = -8.59 Pr > z = 0.000
Arellano-Bond test for AR(2) in first differences: z = 1.27 Pr > z = 0.204
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(422) =8485.23 Prob > chi2 = 0.000
(Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(422) =1377.92 Prob > chi2 = 0.000
(Robust, but weakened by many instruments.)
If Sargan test is only appropriate after a difference-GMM estimator, should I base my decision on Hansen test result? What does a significant Hansen test stat point toward in my model and what are some possible ways to resolve this?
Here's a snapshot of my results and I sincerely appreciate your time and contributions to help me resolve these issues.

Best,
Ash
First, thanks for your invaluable contribution to the community by offering solutions to the challenges faced by users like me! This forum is my go-to place for everything Stata.
That said, I am encountering some challenges modelling my dataset, using system GMM estimation technique. My dataset measures Sale of roughly 10k employees over 60 months (unbalanced panel). I'm using the following syntax:
xi: xtabond2 Sale L.Sale L.Dperform Dtarget MGrwth Incentv1 Incentv2 c.L.Sale#c.Incentv1 c.L.Sale#i.Incentv2 zone* yr*, gmm(L.Sale L.Dperform Dtarget MGrwth c.L.Sale#c.L.Dperform c.L.Sale#c.Incentv1 c.L.Sale#i.Incentv2, collapse) iv(zone* yr*, eq(level)) or twostep rob small
Where, Sale is the dependent variable and L.Sale is included as a regressor. L.Dperform is lag of division's performance (% target achieved), Dtarget is the target for employee's division for the period, and MGrwth is market growth of employee's division (month-over-month). Incentv1 (continuous) and Incentv2 (dummy) are employee incentives, and zone* and yr* are dummies for zone and year, respectively. L.Sale is endogenous for obvious reasons. Hence, I'm treating it's interactions with incentives endogenous as well. Further, L.Dperform DTarget and MGrwth are also endogenous or predetermined. I use robust SE as my employees are nested within divisions. However, when I estimate this model, I encounter the following challenges:
1. First, I get a warning message : "Warning: Two-step estimated covariance matrix of moments is singular. Using a generalized inverse to calculate optimal weighting matrix for two-step estimation." – is it something to be worried about?
2. Next, while I use collapse option to restrict the number of instruments, the model still uses a large number of instruments (over 400). Although, less than the number of groups (over 5000) in my dataset. – My assumption is as far as the number of instruments is lower than the number of groups, it is acceptable. However, is 400+ still too large?
3. My understanding is that I should include all variables that are either endogenous or predetermined in the gmmstyle option (and therefore all interaction with these variables). Isn't that correct? If yes, do I include these endogenous/predetermined variables in the gmm option just as specified in the regressors (i.e. the syntax creates the lagged difference by itself)? Or, should I include only the lags of these variables, i.e., if DTarget is included as a regressor, should I include Dtarget or L.Dtarget in the gmm option? Also, do I include all exogenous regressors in the ivstyle option or only time-specific dummies?
4. The AR(2) test for my model is not significant, suggesting no second-order serial correlation in the first-differenced model. However, both the Sargan test and the Hansen test are highly significant, suggesting that my instruments are not good?
------------------------------------------------------------------------------
Arellano-Bond test for AR(1) in first differences: z = -8.59 Pr > z = 0.000
Arellano-Bond test for AR(2) in first differences: z = 1.27 Pr > z = 0.204
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(422) =8485.23 Prob > chi2 = 0.000
(Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(422) =1377.92 Prob > chi2 = 0.000
(Robust, but weakened by many instruments.)
If Sargan test is only appropriate after a difference-GMM estimator, should I base my decision on Hansen test result? What does a significant Hansen test stat point toward in my model and what are some possible ways to resolve this?
Here's a snapshot of my results and I sincerely appreciate your time and contributions to help me resolve these issues.
Best,
Ash
Comment