Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using saved model estimates on an updated data-set

    Hello,

    I have a large dataset of 59,000,000 observations and 100+ variables.

    I have ran a regression, the point of which is to obtain a predicted cost for each of the 59,000,000 observations (these are people).

    I ran the regression and saved the model estimates to a .ster file using estimates save full_model, replace.

    After running the regression, and obtaining the predictions, I realised that some observations (around 100,000) had missing predictions. This is because some of the independent variables had missing values. I replaced the missing values. I then used the .ster file to re-run the parameter estimates against the full dataset (this was done as a quick fix to get predictions for all observations, prior to running the model at a later date).

    However, when I obtained the predictions, I still had the same number of missing predictions. I understood that I could use saved model estimates against an updated dataset - but despite replacing missing values for certain variables, I am still getting missing predictions for the same observations, even though they no longer have missing values.

    I have read the documentation, and I think it has something to do with setting e(sample) - but I am not sure I quite follow.

    Is anyone able to explain why I still get missing predictions when using saved estimates despite replacing missing values in the data?

  • #2
    After running the regression, and obtaining the predictions, I realised that some observations (around 100,000) had missing predictions. This is because some of the independent variables had missing values. I replaced the missing values. I then used the .ster file to re-run the parameter estimates against the full dataset (this was done as a quick fix to get predictions for all observations, prior to running the model at a later date).

    However, when I obtained the predictions, I still had the same number of missing predictions. I understood that I could use saved model estimates against an updated dataset - but despite replacing missing values for certain variables, I am still getting missing predictions for the same observations, even though they no longer have missing values.
    You need to create a MWE that replicates this. It may be that you are overlooking missing values in other variables. As long as you have filled in all missing values, you should get predictions as in the example below.

    Code:
    sysuse auto
    regress mpg weight i.rep78
    predict mpghat, xb
    sum mpghat
    est save m1
    clear all
    sysuse auto
    replace rep78=2 if missing(rep78)
    est use m1
    predict mpghat2, xb
    sum mpghat2
    Res.:

    Code:
    . sum mpghat
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
          mpghat |         69    21.28985    4.807912   10.95315   30.24022
    
    . 
    . sum mpghat2
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
         mpghat2 |         74    21.33238    4.695795   10.95315   30.24022

    Comment


    • #3
      I then used the .ster file to re-run the parameter estimates against the full dataset
      The meaning of this is unclear to me. As Andrew indicates, you very much need to provide example code, preferably a reproducible example such as his.

      In particular, if you did something like the following modification to Andrew's example
      Code:
      replace rep78=2 if missing(rep78)
      est use m1
      regress
      predict mpghat2, xb
      then you should understand that your second regression was not "re-run" on the full dataset. All that happened was the previous estimates were reprinted unchanged. That does not directly explain why your prediction again had missing values - as Andrew's example demonstrated - but it is possible, for example, that you replaced missing values in a categorical variable with a new category, for which no coefficient has been estimated.

      Here's a demonstration of the second regression not actually needing any data in memory.
      Code:
      . sysuse auto
      (1978 Automobile Data)
      
      . regress mpg weight i.rep78
      
            Source |       SS           df       MS      Number of obs   =        69
      -------------+----------------------------------   F(5, 63)        =     25.78
             Model |   1571.8892         5   314.37784   Prob > F        =    0.0000
          Residual |  768.313699        63  12.1954555   R-squared       =    0.6717
      -------------+----------------------------------   Adj R-squared   =    0.6456
             Total |   2340.2029        68  34.4147485   Root MSE        =    3.4922
      
      ------------------------------------------------------------------------------
               mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
            weight |   -.005503    .000601    -9.16   0.000     -.006704    -.004302
                   |
             rep78 |
                2  |  -.4786043   2.765035    -0.17   0.863    -6.004085    5.046877
                3  |  -.4715623   2.553145    -0.18   0.854    -5.573614     4.63049
                4  |  -.5990319   2.606599    -0.23   0.819    -5.807905    4.609841
                5  |   2.086276   2.724817     0.77   0.447    -3.358836    7.531388
                   |
             _cons |   38.05941   3.093361    12.30   0.000     31.87783      44.241
      ------------------------------------------------------------------------------
      
      . est save m1, replace
      file m1.ster saved
      
      . clear all
      
      . est use m1
      
      . describe
      
      Contains data
        obs:             0                          
       vars:             0                          
      Sorted by: 
      
      . regress
      
            Source |       SS           df       MS      Number of obs   =        69
      -------------+----------------------------------   F(5, 63)        =     25.78
             Model |   1571.8892         5   314.37784   Prob > F        =    0.0000
          Residual |  768.313699        63  12.1954555   R-squared       =    0.6717
      -------------+----------------------------------   Adj R-squared   =    0.6456
             Total |   2340.2029        68  34.4147485   Root MSE        =    3.4922
      
      ------------------------------------------------------------------------------
               mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
            weight |   -.005503    .000601    -9.16   0.000     -.006704    -.004302
                   |
             rep78 |
                2  |  -.4786043   2.765035    -0.17   0.863    -6.004085    5.046877
                3  |  -.4715623   2.553145    -0.18   0.854    -5.573614     4.63049
                4  |  -.5990319   2.606599    -0.23   0.819    -5.807905    4.609841
                5  |   2.086276   2.724817     0.77   0.447    -3.358836    7.531388
                   |
             _cons |   38.05941   3.093361    12.30   0.000     31.87783      44.241
      ------------------------------------------------------------------------------
      
      .

      Comment

      Working...
      X