Prediction errors by groups

Mikkel Zeuthen

Join Date: Jan 2016

Posts: 21
#1

Prediction errors by groups

28 Nov 2017, 07:16

Dear all

Does any of you have and idea of how to calculate prediction errors or uncertainty by groups of observations?

Say I have a dataset of individual observations in a country and I predict average grades (gpa) by the following linear regression:

reg gpa age i.gender i.fathereducation
predict yhat

So by tabstat yhat, by(region) I get the predicted grades by region, but how do I calculate the s.e. or confidence/prediction interval by region?

Thanks a lot
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#2

28 Nov 2017, 08:23

-tabstat- has a -statistics()- option that lets you specify which statistics you want calculated for each group. mean is one of them (and what you get by default if you don't specify anything) but it also has semean for the standard error. -help tabstat- for details.

Another approach altogether would be to use the -ci- command for this. See -help ci-.

Added: It dawns on me that this may not be what you want. Those approaches represent the variation in the prediction among the observations in each group, but they do not account at all for uncertainty in the regression coefficients themselves. So maybe what you really want wold be the output of -margins, over(region)- after your regression.

Last edited by Clyde Schechter; 28 Nov 2017, 09:02.
Comment
Mikkel Zeuthen

Join Date: Jan 2016

Posts: 21
#3

28 Nov 2017, 09:14

Hi Clyde.

Thanks for your reply - yeah I guess the problem is that your first solution doesn't take the differences between predicted and actual values into account (residuals).

I'll look into your addet suggestion. Is the margins a postestimation as well?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#4

28 Nov 2017, 09:34

Yes, -margins- is a postestimation command.
Comment
Mikkel Zeuthen

Join Date: Jan 2016

Posts: 21
#5

29 Nov 2017, 01:23

So I've tried the -margins,over(region) - but it's not the same as tabstat yhat,by(region). The results are different...?

Moreover I'm a Little confused about the s.e. from the margins postestimation - how are these s.e. different from the s.e. I get by my regress?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17676

29 Nov 2017, 02:23

Mikkel:
another approach that springs to my mind is:

Code:

. sysuse auto.dta
(1978 Automobile Data)

. regress price mpg i.foreign

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(2, 71)        =     14.07
       Model |   180261702         2  90130850.8   Prob > F        =    0.0000
    Residual |   454803695        71  6405685.84   R-squared       =    0.2838
-------------+----------------------------------   Adj R-squared   =    0.2637
       Total |   635065396        73  8699525.97   Root MSE        =    2530.9

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -294.1955   55.69172    -5.28   0.000    -405.2417   -183.1494
             |
     foreign |
    Foreign  |   1767.292    700.158     2.52   0.014     371.2169    3163.368
       _cons |   11905.42   1158.634    10.28   0.000     9595.164    14215.67
------------------------------------------------------------------------------

. predict predict, xb

. predict se, stdp

. bysort foreign: list predict se if _n<=10

---------------------------------------------------------------------------------------------------------
-> foreign = Domestic

     +---------------------+
     |  predict         se |
     |---------------------|
  1. | 5433.114   371.2582 |
  2. | 6904.091   384.6718 |
  3. | 5433.114   371.2582 |
  4. | 6021.504   351.1114 |
  5. | 7492.482   442.0976 |
     |---------------------|
  6. | 6609.896   365.4288 |
  7. | 4256.332   491.3017 |
  8. | 6021.504   351.1114 |
  9. | 7198.287   410.6212 |
 10. |   6315.7   353.9875 |
     +---------------------+

---------------------------------------------------------------------------------------------------------
-> foreign = Foreign

     +---------------------+
     |  predict         se |
     |---------------------|
  1. | 8671.384   691.7728 |
  2. |  6906.21   548.5566 |
  3. | 6317.819   539.7479 |
  4. |  6906.21   548.5566 |
  5. | 3375.864   784.5907 |
     |---------------------|
  6. | 6612.015   541.3127 |
  7. | 7494.602   579.0627 |
  8. | 7494.602   579.0627 |
  9. | 6317.819   539.7479 |
 10. | 5435.232   568.7454 |
     +---------------------+


.

Kind regards,
Carlo
(Stata 19.0)

Comment

Mikkel Zeuthen

Join Date: Jan 2016

Posts: 21
#7

29 Nov 2017, 03:13

Hi Carlo. Thanks for your contribution.

I guess your suggestion doesn't report the s.e. on a group level (foreign)? or am I missing something here?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17676

29 Nov 2017, 03:42

Mikkel.
yes, you're correct,
You may want to try:

Code:

forval i = 0/1 {
  2. mean predict se if foreign==`i'
  3. }

Mean estimation                   Number of obs   =         52

--------------------------------------------------------------
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
     predict |   6072.423    193.515      5683.925    6460.921
          se |   426.4441   13.84191      398.6553    454.2329
--------------------------------------------------------------

Mean estimation                   Number of obs   =         22

--------------------------------------------------------------
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
     predict |   6384.682   414.6715      5522.325    7247.038
          se |   636.3742     27.252      579.7006    693.0479
--------------------------------------------------------------

However, as you already noted, the SEs differ from the ones estimated via -margins-.

Kind regards,
Carlo
(Stata 19.0)

Comment

Mikkel Zeuthen

Join Date: Jan 2016

Posts: 21
#9

29 Nov 2017, 04:20

thanks again. Yes it differs from -margins-, plus I'm not sure the averaged s.e. per Group is at good measure of the groups prediction error. It's not easy!
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17676
#10

29 Nov 2017, 04:47

Mikkel:
yes, I share your concerns.
And that's why I wouldn't have followed your reasearch strategy.
I would have added an -i.region- predictor in the right-hand side of the regression equation, instead.
However, I'm sure you have good methodological reasons (or constraints) to act differently.

Kind regards,
Carlo
(Stata 19.0)
Comment
Mikkel Zeuthen

Join Date: Jan 2016

Posts: 21
#11

29 Nov 2017, 05:16

that wouldn't do because I'm not looking for a "everything else being equal"-interpretation, it's for a benchmark analysis
Comment

Announcement

Prediction errors by groups

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment