statsby and storing predicted values

RJ seaman

Join Date: Jul 2018

Posts: 4
#1

statsby and storing predicted values

20 Jul 2018, 05:38

I have done some searching for a solution to this problem but I cannot find the exact answer.

I have used statsby and saved the coefficents of a linear regression as a new data set.

I want to know if the predict command can be used with statsby to also save predicted values of the linear regression?

My alternative idea is to use the data set with the saved coefficients to manually code the regression and get predicted values - but this seems a little messy especially as I am running the regression for 888000 groups.

Thank you
Tags: None
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#2

20 Jul 2018, 06:20

I dont believe statsby would provide predictions, but maybe someone will correct me.

Your question is somewhat unclear. Predict works on individual observations, statsby on groups. So you could implement your alternative idea by collapsing original data on groups and multiplying coefficients with variable values. The code to do so would be exactly as elaborate for 2 groups as for 800,000 groups, so I dont see an issue there.
I do wonder what your purpose with 880,000 groups is. Maybe some more explanation on the point of your exercise would help give better suggestions.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#3

20 Jul 2018, 06:23

Short answer is No. statsby saves one observation for each by group. But you want a dataset with the same number of observations as the original values. But what you could do is merge the coefficients dataset back with the original data and run the calculation.

Alternatively, you could use a command like rangestat (SSC) which isn't a reduction command. Here is a dopey example.

Code:

. webuse grunfeld, clear . rangestat (reg) invest mvalue, int(year 0 0) . gen predicted = b_cons + b_mvalue * mvalue

rangestat ran a series of regressions and left the coefficients in the dataset as new variables. Then it's just a matter of doing all the calculations for all the regressions: but, white magic and see how educating some at Hogwarts helps the community, it's just one command.

There are other commands in this territory with loosely similar goals, but I am most familiar with rangestat. You need to download it first with

Code:

ssc install rangestat

and then read the help.
Comment
RJ seaman

Join Date: Jul 2018

Posts: 4
#4

20 Jul 2018, 06:28

Thanks for clarifying my first thoughts - that statsby is not appropriate for storing predictions.

The point of the exercise is to have a linear regression for each age (0-110+), sex and year (4 points in time) group to identify the extent to which place of residence (5 places) effects a health outcome. I now want the predicted values for the 5 places of residence.

I think the solution is to get the predicted results for the individual observations and, if need be, to merge with the coefficient data set.

Thanks for your quick response.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#5

20 Jul 2018, 06:45

I think the solution is to get the predicted results for the individual observations and, if need be, to merge with the coefficient data set.

No; that's a misunderstanding of the advice.

You have a choice. (Dopey) You can use statsby, then merge with the original and then do the calculation. I really don't recommend that, because (smart) you can do it all in place with any merge at all.

The example in #3 is deliberately one you can imitate yourself so that you can see what happens. edit the data before and after to understand what is going on.
Comment
RJ seaman

Join Date: Jul 2018

Posts: 4
#6

20 Jul 2018, 08:17

Thanks Nick - this seems to be the exact solution.

I am now struggling to put rangestat into a loop or use it with bysort.

For example with the 'webuse grunfeld' example data set how would I make the regression run within each company?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#7

20 Jul 2018, 08:27

Code:

rangestat (reg) invest mvalue, int(company 0 0)

The 0 0 are offsets. The syntax in this case is that the interval runs from company+0 to company+0. Here it's a long-winded way to say for each distinct company.

Otherwise put, interval() specifies the interval that defines a subset or subseries for regression. If you want the interval to be a single integer, that's fine.The program doesn't mind or care. The only rule is that the interval is numeric. So string identifiers need to be mapped to distinct integers.

The leading use case for rangestat is statistics within ranges, possibly overlapping, for time series, but the syntax allows several other applications too.
Comment
RJ seaman

Join Date: Jul 2018

Posts: 4
#8

20 Jul 2018, 08:45

Thanks so much Nick this solves the problem perfectly!
Comment

Announcement

statsby and storing predicted values

Comment

Comment

Comment

Comment

Comment

Comment

Comment