Hi Statalisters,
I wanted to share with you a program that I've been working on that some of you may find really useful.
Suppose that you are interested in running the same OLS regression model on a set of mutually-exclusive subsets of your dataset, which I'll suppose is indexed by a variable g. You want the slopes and intercepts to vary across g. Here are a few more concrete motivating examples:
1. You want to run the same regression model on all observations in a given year for many different years.
2. You want to run the same regression model on all observations in a given geographic unit (US state, county, ZIP code, commuting zone, Census tract, Census block, etc) across all such units.
You have a couple options to do this. You can do this manually with a quick loop over each distinct value (or tuple) of your 'by' group(s). You can also do this with statsby, but it is pretty slow, and for whatever reason you can't access the full VCV matrix of each estimate from statsby: regress. If there are only a small number of -by- groups and there aren't too many independent variables in your regression model, you can fully interact each regressor variable with your by group, so long as total number of regressors doesn't exceed Stata's limit of just under 11,000. Or you can use this new program, regressby, which can be hundreds of times faster than any of these options. In my research team's usage, this has sped up some of our estimation scripts by a factor of about 300, reducing the runtime of Stata scripts that would have otherwise taken weeks on a server to under an hour. In particular, this script generates huge improvements in performance relative to statsby when you have many -by- groups.
You can read more about regressby on Github here: https://github.com/mdroste/stata-regressby
Here is a minimal working example comparing the usage of statsby vs. regressby:
This program supports the usual OLS asymptotic standard errors, heteroskedasticity-robust (White) standard errors, cluster-robust standard errors, and analytic weights. Support for frequency weights, absorbed fixed effects, and outputting additional diagnostics (R2, RMSE, etc) will be coming very soon, maybe by the time you read this.
I am especially eager to hear your feedback on features you would like and any bugs that you might encounter. I hope you find this useful - thanks so much!
Best,
Mike
I wanted to share with you a program that I've been working on that some of you may find really useful.
Suppose that you are interested in running the same OLS regression model on a set of mutually-exclusive subsets of your dataset, which I'll suppose is indexed by a variable g. You want the slopes and intercepts to vary across g. Here are a few more concrete motivating examples:
1. You want to run the same regression model on all observations in a given year for many different years.
2. You want to run the same regression model on all observations in a given geographic unit (US state, county, ZIP code, commuting zone, Census tract, Census block, etc) across all such units.
You have a couple options to do this. You can do this manually with a quick loop over each distinct value (or tuple) of your 'by' group(s). You can also do this with statsby, but it is pretty slow, and for whatever reason you can't access the full VCV matrix of each estimate from statsby: regress. If there are only a small number of -by- groups and there aren't too many independent variables in your regression model, you can fully interact each regressor variable with your by group, so long as total number of regressors doesn't exceed Stata's limit of just under 11,000. Or you can use this new program, regressby, which can be hundreds of times faster than any of these options. In my research team's usage, this has sped up some of our estimation scripts by a factor of about 300, reducing the runtime of Stata scripts that would have otherwise taken weeks on a server to under an hour. In particular, this script generates huge improvements in performance relative to statsby when you have many -by- groups.
You can read more about regressby on Github here: https://github.com/mdroste/stata-regressby
Here is a minimal working example comparing the usage of statsby vs. regressby:
Code:
* Set up clear all set obs 100000 set seed 123 set rmsg on * Generate a dataset gen g = ceil(runiform()*1000) gen x = runiform() gen y = g + g*x + rnormal() tempfile t1 save `t1' * Test with statsby use `t1', clear statsby _b _se, clear by(g): regress y x * Test with regressby use `t1', clear regressby y x, by(g)
This program supports the usual OLS asymptotic standard errors, heteroskedasticity-robust (White) standard errors, cluster-robust standard errors, and analytic weights. Support for frequency weights, absorbed fixed effects, and outputting additional diagnostics (R2, RMSE, etc) will be coming very soon, maybe by the time you read this.
I am especially eager to hear your feedback on features you would like and any bugs that you might encounter. I hope you find this useful - thanks so much!
Best,
Mike
Comment