Negative binomial alternative to Sergio Corriea’s ppmlhdfe command in Stata

Kasey Zapatka

Join Date: Feb 2019

Posts: 12
#1

Negative binomial alternative to Sergio Corriea’s ppmlhdfe command in Stata

16 Mar 2021, 18:11

Hello everyone,

I am working on a paper with a colleague using individual-level micro data from the US Census (2015-2019 ACS 5-year estimates). We are predicting the wages of individuals while controlling for variety of covariates. One of the covariates is the state puma in which an individual resides. This becomes an issue because there are over 2,000 different statepumas across the country. Since we are only interested in controlling for state puma, we could absorb this using areg command and it runs relatively quickly. However, since we are specifically interested in reporting accurate estimates of individuals wages to support our argument, we can’t rely on taking the anti-log of the dependent variable. Instead, to obtain accurate estimates we should use poission regression. But since there is dramatic over-dispersion of wages, we actually need to use negative binomial regression. We run into problems because the dataset is very large ( ~4 million observations) and the large number of categories on the state puma variable. Sometimes it does not converge.

My colleague and I are wondering if there is a negative binomial counter part to Sergio Corriea’s ppmlhdfe command in Stata? This command uses a psuedo-likelihood procedure instead of a maximum likelihood one to dramatically speeds up the analysis. As I said, we don’t think we can use it for our analysis, because our dependent variable is over-dispersed and therefore requires an additional parameter to adequately model the over-dispersion.

Any advice on this would be greatly appreciated.

Best,
Kasey
Tags: fixed effects, negative binomial, ppmlhdfe
Nick Cox

Join Date: Mar 2014

Posts: 35115
#2

16 Mar 2021, 18:19

Cross-posted at https://stackoverflow.com/questions/...mmand-in-stata
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 9911
#3

16 Mar 2021, 18:20

But since there is dramatic over-dispersion of wages, we actually need to use negative binomial regression.

See Jeff Wooldridge's comments in #3 of https://www.statalist.org/forums/for...-poisson-model on why this assertion is incorrect. In general, you can implement high dimensional fixed effects in linear models and Poisson, but not in nonlinear models.
1 like
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2079
#4

16 Mar 2021, 18:52

You should use Poisson regression in this context, as you already said you want effects on the mean. Negative Binomial is not even a close second. As Andrew points out, you should ignore the canard that says one should not use Poisson regression when there is overdispersion. Poisson regression is completely robust; NegBin is not.

If you have lots of individuals per PUMA -- as I believe you must -- then including the Puma coefficients in a pooled Poisson estimation will work. I would suggest tricking stating by using xtpoisson after an xtset puma, but then your choice of standard errors is somewhat restricted. You're forced to compute the nonrobust standard errors or cluster at the puma level. The first should not be used and the second might not be needed.
1 like
Comment
Kasey Zapatka

Join Date: Feb 2019

Posts: 12
#5

17 Mar 2021, 00:01

Thanks everyone for all great feedback and pointing out our common, but incorrect assumptions about negative binomial.

Jeff Wooldridge, we attempted the xtpoission trick you suggested (specifying xtset puma and xtpoisson), but the FE specification for xtpoisson does not allow for clustered standard errors. Instead we can either choose oim (the default), robust, bootstrap, or jackknife. We specified a null HLM to check the ICC and found that about 9% of the variance in wages occurred between pumas, so we think clustering the SEs is in order. In light of this, should we reconsider using Sergio Correia's ppmlhdfe command since it allows for clustering of standard errors?

Thanks again, much appreciated.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2079
#6

17 Mar 2021, 11:09

Kasey: It's a quirk of xtpoisson (and a few other Stata commands) that vce(robust) and vce(cluster puma) are the same. The latter is not allowed for some reason. I think the idea is that with FE methods and T not so large, it's all or nothing. With xtreg, vce(robust) and vce(cluster puma) are both allowed and give identical standard errors: robust to serial correlation and heteroskedasticity. So if you want to cluster at the puma level, xtpoisson, fe vce(robust) does it. Jeff
1 like
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 2960
#7

17 Mar 2021, 11:55

Further to the excellent advice Jeff provided, I would just like to add that overdispersion does not even make sense when the dependent variable is not a count because we can change the relation between the mean and the variance simply by changing the scale. This also implies that the results of the negbin regression (or zero-inflated models) in this context will depend on the units used to measure the dependent variable.
3 likes
Comment
Kasey Zapatka

Join Date: Feb 2019

Posts: 12
#8

17 Mar 2021, 12:53

Thanks Jeff Wooldridge, for the clarification on xtpoission. That is not an obvious quirk I would have picked up on. Yes, that's a good point Joao Santos Silva. I'll keep that in mind when I encounter overdispersion in the future. Thanks for the comments, both of you, very helpful.
Comment

Announcement

Negative binomial alternative to Sergio Corriea’s ppmlhdfe command in Stata

Comment

Comment

Comment

Comment

Comment

Comment

Comment