Using saved model estimates on an updated data-set

Daniel Sutcliffe

Join Date: Sep 2020

Posts: 77
#1

Using saved model estimates on an updated data-set

29 Sep 2020, 12:35

Hello,

I have a large dataset of 59,000,000 observations and 100+ variables.

I have ran a regression, the point of which is to obtain a predicted cost for each of the 59,000,000 observations (these are people).

I ran the regression and saved the model estimates to a .ster file using estimates save full_model, replace.

After running the regression, and obtaining the predictions, I realised that some observations (around 100,000) had missing predictions. This is because some of the independent variables had missing values. I replaced the missing values. I then used the .ster file to re-run the parameter estimates against the full dataset (this was done as a quick fix to get predictions for all observations, prior to running the model at a later date).

However, when I obtained the predictions, I still had the same number of missing predictions. I understood that I could use saved model estimates against an updated dataset - but despite replacing missing values for certain variables, I am still getting missing predictions for the same observations, even though they no longer have missing values.

I have read the documentation, and I think it has something to do with setting e(sample) - but I am not sure I quite follow.

Is anyone able to explain why I still get missing predictions when using saved estimates despite replacing missing values in the data?
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10069
#2

29 Sep 2020, 23:04

After running the regression, and obtaining the predictions, I realised that some observations (around 100,000) had missing predictions. This is because some of the independent variables had missing values. I replaced the missing values. I then used the .ster file to re-run the parameter estimates against the full dataset (this was done as a quick fix to get predictions for all observations, prior to running the model at a later date).

However, when I obtained the predictions, I still had the same number of missing predictions. I understood that I could use saved model estimates against an updated dataset - but despite replacing missing values for certain variables, I am still getting missing predictions for the same observations, even though they no longer have missing values.

You need to create a MWE that replicates this. It may be that you are overlooking missing values in other variables. As long as you have filled in all missing values, you should get predictions as in the example below.

Code:

sysuse auto regress mpg weight i.rep78 predict mpghat, xb sum mpghat est save m1 clear all sysuse auto replace rep78=2 if missing(rep78) est use m1 predict mpghat2, xb sum mpghat2

Res.:

Code:

. sum mpghat Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- mpghat | 69 21.28985 4.807912 10.95315 30.24022 . . sum mpghat2 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- mpghat2 | 74 21.33238 4.695795 10.95315 30.24022
1 like
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

30 Sep 2020, 07:59

I then used the .ster file to re-run the parameter estimates against the full dataset

The meaning of this is unclear to me. As Andrew indicates, you very much need to provide example code, preferably a reproducible example such as his.

In particular, if you did something like the following modification to Andrew's example

Code:

replace rep78=2 if missing(rep78)
est use m1
regress
predict mpghat2, xb

then you should understand that your second regression was not "re-run" on the full dataset. All that happened was the previous estimates were reprinted unchanged. That does not directly explain why your prediction again had missing values - as Andrew's example demonstrated - but it is possible, for example, that you replaced missing values in a categorical variable with a new category, for which no coefficient has been estimated.

Here's a demonstration of the second regression not actually needing any data in memory.

Code:

. sysuse auto
(1978 Automobile Data)

. regress mpg weight i.rep78

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(5, 63)        =     25.78
       Model |   1571.8892         5   314.37784   Prob > F        =    0.0000
    Residual |  768.313699        63  12.1954555   R-squared       =    0.6717
-------------+----------------------------------   Adj R-squared   =    0.6456
       Total |   2340.2029        68  34.4147485   Root MSE        =    3.4922

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |   -.005503    .000601    -9.16   0.000     -.006704    -.004302
             |
       rep78 |
          2  |  -.4786043   2.765035    -0.17   0.863    -6.004085    5.046877
          3  |  -.4715623   2.553145    -0.18   0.854    -5.573614     4.63049
          4  |  -.5990319   2.606599    -0.23   0.819    -5.807905    4.609841
          5  |   2.086276   2.724817     0.77   0.447    -3.358836    7.531388
             |
       _cons |   38.05941   3.093361    12.30   0.000     31.87783      44.241
------------------------------------------------------------------------------

. est save m1, replace
file m1.ster saved

. clear all

. est use m1

. describe

Contains data
  obs:             0                          
 vars:             0                          
Sorted by: 

. regress

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(5, 63)        =     25.78
       Model |   1571.8892         5   314.37784   Prob > F        =    0.0000
    Residual |  768.313699        63  12.1954555   R-squared       =    0.6717
-------------+----------------------------------   Adj R-squared   =    0.6456
       Total |   2340.2029        68  34.4147485   Root MSE        =    3.4922

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |   -.005503    .000601    -9.16   0.000     -.006704    -.004302
             |
       rep78 |
          2  |  -.4786043   2.765035    -0.17   0.863    -6.004085    5.046877
          3  |  -.4715623   2.553145    -0.18   0.854    -5.573614     4.63049
          4  |  -.5990319   2.606599    -0.23   0.819    -5.807905    4.609841
          5  |   2.086276   2.724817     0.77   0.447    -3.358836    7.531388
             |
       _cons |   38.05941   3.093361    12.30   0.000     31.87783      44.241
------------------------------------------------------------------------------

.

Announcement

Using saved model estimates on an updated data-set

Comment

Comment