Mysterious behaviour of -predict- after regress, when a regressor went missing

Joro Kolev

Join Date: Aug 2018

Posts: 3047
#1

Mysterious behaviour of -predict- after regress, when a regressor went missing

28 Oct 2018, 09:16

Good afternoon,

I encountered the following mysterious behaviour of -predict- , which I cannot rationalise by anything that I know regarding how Stata works.

1. I run a regression of price on mpg, and I predict the residual. (Standard scenario, the benchmark. )

2. I drop the regressor mpg, and try again to predict.

a) When I call the predicted value an arbitrary name, Stata behaves as expected and tells me that it cannot find the regressor mpg, and hence is unable to calculate the predictions.

b) The mystery occurs when I call the predicted values by the name of the missing regressor mpg. Stata does not report any problem, and calculates something, I am not sure what...

Code:

. sysuse auto, clear (1978 Automobile Data) . keep price mpg . keep in 1/5 (69 observations deleted) . qui reg price mpg . predict correctresidual, resid . list correctresidual +-----------+ | correct~l | |-----------| 1. | 293.3505 | 2. | -1292.99 | 3. | -6.649485 | 4. | 115.8144 | 5. | 890.4742 | +-----------+ . drop mpg . predict wrongresidual, resid variable mpg not found r(111); . predict mpg, resid . list mpg +-----------+ | mpg | |-----------| 1. | 8.00e+24 | 2. | 4.25e+24 | 3. | -9846.547 | 4. | 1.38e+35 | 5. | 3.22e+34 | +-----------+ .

Does anyone have any guess what just happened here?
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4426
#2

28 Oct 2018, 14:06

if the above shows what you actually did, the problem is that you dropped "mpg" and then tried to use it (-predict- calls on any variable in the regression) and Stata correctly tells you it cannot be found - then you formed "mpg" with your last -predict- command; if you are having trouble with this, please look at the help for each of the commands you are using
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35356
#3

28 Oct 2018, 15:48

Once you dropped mpg, then mpg is available as a (new) variable name for whatever you want. Using mpg as a new name is emphatically not a way to retrieve the original mpg.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3047
#4

28 Oct 2018, 16:05

Hi Nick, that mpg becomes available for a new variable is clear.

What is not clear to me is to what the right hand side of the expression

predict mpg, resid

which is implicitly

gen mpg = price - _b[_cons] - _b[mpg]*mpg

evaluates when the right hand side is missing, because I dropped mpg? or apparently not missing to Stata, as Stata goes through and carries out some calculation.

I am not trying to recover mpg. The example above is a toy example illustrating mysterious behaviour which got me puzzled.
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3047

28 Oct 2018, 16:29

Here is what happens when I do by hand what Stata is supposed to do when I call -predict-. Nothing mysterious happens here, Stata behaves as expected.

The puzzle is why the outcome above, when using -predict-, is not the same as the outcome below when I manually generate the prediction

Code:

. sysuse auto, clear
(1978 Automobile Data)

. keep in 1/5
(69 observations deleted)

. keep price mpg

. reg price mpg

      Source |       SS           df       MS      Number of obs   =         5
-------------+----------------------------------   F(1, 3)         =      9.08
       Model |  7761889.59         1  7761889.59   Prob > F        =    0.0571
    Residual |  2564278.41         3  854759.471   R-squared       =    0.7517
-------------+----------------------------------   Adj R-squared   =    0.6689
       Total |    10326168         4     2581542   Root MSE        =    924.53

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |   -447.268   148.4247    -3.01   0.057    -919.6216    25.08551
       _cons |   13645.55   2879.592     4.74   0.018     4481.401    22809.69
------------------------------------------------------------------------------

. drop mpg

. gen double mpg = price - _b[_cons] - _b[mpg]*mpg
mpg not found
r(111);

.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35356
#6

28 Oct 2018, 16:34

Yes indeed; good question. The results don't make sense as residuals but I agree: why you get results at all is a mystery.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29911
#7

28 Oct 2018, 17:05

Well, predict.ado calls a built-in command _predict, so only StataCorp knows for sure what is happening here.

Based on general experience with programming languages, here's my best guess as to what is going on:

After dropping mpg, when you run -predict mpg, resid-, Stata begins by creating a new variable mpg after duly verifying that no such variable already exists, and one of the early steps in that is probably requesting that the operating system allocate some memory for that. That memory is given over to Stata and it contains whatever happened to be left in it by the last program that happened to use that block of memory. This is known technically as "filled with garbage." Stata uses that memory and duly assigns it the variable name mpg. Next it notes that you want the residual. In preparing to calculate the residual, Stata notes that a variable named mpg was, indeed, among the predictors in the regression, and that variable mpg still exists, so it procedes to calculate price - _b[_cons] - _b[mpg]*mpg. All of these exist (dropping -mpg- does not erase _b[mpg]), and the calculation proceeds. And then the results are stored in variable mpg. The problem, of course, is that the values of mpg that are used to calculate the residuals are the garbage that happened to be in the allocated memory and have nothing to do with the actual data.

Bear in mind that there is no requirement, with -predict-, that the regression variables be the same as they were when the regression ran. It is intended that -predict- can be run with counterfactual data that is replaced into the predictor variables after the regression: that is how -margins- works. What has gone wrong in your situation is that in this case the counterfactual data is garbage.

That said, it is poor design, if not an outright bug, to allow -predict newvar- where newvar is the same as a variable used as a predictor in the regression. In fact, if you hadn't -drop-ped mpg and tried to -predict mpg, resid-, Stata would complain that mpg already exists. At the point where it does that, better design would also be for Stata to verify that the new variable is not, itself, a variable in the regression.

Added: I'm curious how you stumbled upon this. The -predict- command has been around for a long time, and as far as I know, nobody has ever noticed this problem before. I can't think of any situation where I would intentionally want to -drop mpg- and then -predict mpg-. So did this just happen by accident: some typo or copy/paste error or something like that? Or were you deliberately testing -predict- and thought of this as a test case? Or is there some actual use case for -predict- where -drop mpg- followed by -predict mpg- might make sense? If the last, what is the use case, and what results were you expecting to get?

Further added: You can test my hypothesis that variable mpg is being populated with garbage by shutting down Stata. Run some other programs that use up a lot of memory. Then re-launch Stata and do the -drop mpg- -predict mpg, resid- routine again. If I am correct, the results will be equally mysterious looking, but different from what you got the first time.

Last edited by Clyde Schechter; 28 Oct 2018, 17:14.
2 likes
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3047
#8

28 Oct 2018, 17:45

Hi Clyde.

The funny part of the mystery is that I managed to figure out what the built in - _predict - does. I replicated the "bug," when the call is to - _predict -, (but not to -predict-).

I have long personal history where the command -replace- regularly backstabs me, at occasions where I try to be economical and save on creation of new variables. So my first guess before I read your post was that somewhere behind the scenes the backstabbing -replace- is out to get me again.

Here is a call to _predict, followed by manual replication involving -replace- that gives the same (buggy) result. And yes, -replace- is at action here as I expected:

Code:

. sysuse auto, clear (1978 Automobile Data) . keep in 1/5 (69 observations deleted) . keep price mpg . qui reg price mpg . drop mpg . _predict mpg, resid . list mpg +-----------+ | mpg | |-----------| 1. | -9546.547 | 2. | -8896.547 | 3. | -9846.547 | 4. | -8829.547 | 5. | -5818.546 | +-----------+

Now I am going to get the same by -generate- and -replace-

Code:

. sysuse auto, clear (1978 Automobile Data) . keep in 1/5 (69 observations deleted) . keep price mpg . qui reg price mpg . drop mpg . gen double mpg = 0 . replace mpg = price - _b[_cons] - _b[mpg]*mpg (5 real changes made) . list mpg +------------+ | mpg | |------------| 1. | -9546.5464 | 2. | -8896.5464 | 3. | -9846.5464 | 4. | -8829.5464 | 5. | -5818.5464 | +------------+ .

The (buggy) result is the same.
1 like
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3047
#9

28 Oct 2018, 18:13

Regarding Clyde's question how I stumbled upon this issue. I was doing something silly. I thought that I could use the variable *res* in the regression, then drop it, and still after dropping it generate a legitimate prediction based on as if *res* is still present in my data. Because this is silly, I figured out that there is some problem pretty quickly.

There is no need for explaining that the code below is silly, I have fixed the code now and it does what I want. I am providing the wrong code just to illustrate how I came upon the issue with -predict-.

So I am implementing iterative Telser system estimator. I am copying short explanation of the procedure from my previous post
https://www.statalist.org/forums/for...inate-the-loop

For concreteness say I have a system of two equations:
(1) y = x'b + e
(2) w = z'g + v.

I want to implement the following procedure due to Telser, L. G. (1964). Iterative estimation of a set of linear regression equations. Journal of the American Statistical Association, 59(307), 845-862.

Estimate (2) by OLS, get the residuals v(0) = w - z'g(0).
Estimate modification of (1) by OLS: y = x'b(0) + a*v(0) + error. Get the residual e(0) = y - x'b(0).
Estimate modification of (2) by OLS: w = z'g(1) + c*e(0) + error. Get the residual v(1) = w - z'g(1)
Estimate modification of (1) by OLS: y = x'b(1) + a*v(1) + error. Get the residual e(1) = y - x'b(1).
Estimate modification of (2) by OLS: w = z'g(2) + c*e(1) + error. Get the residual v(2) = w - z'g(2)
................. Repeat until b(n) converges

And here is the wrong code which attempted to do the above "economically"

Code:

forvalues i=1/16000 { reg mpg headroom trunk weight predict double res, resid reg price mpg res drop res // this is the trouble maker predict double res, resid replace res = res + _b[res]*res mat b = e(b) reg mpg headroom trunk weight res drop res // this is the trouble maker predict double res, resid replace res = res + _b[res]*res reg price mpg res drop res predict double res, resid replace res = res + _b[res]*res mat bb = e(b) dis "The last iteration is " `i' if mreldif(b, bb)<1e-6 continue, break }
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3047

#10

28 Oct 2018, 18:27

Or maybe the code was like this, I already changed it to something that it works and I deleted the wrong version, so I am repeating it from my memory:

Code:

sysuse auto, clear

reg mpg headroom trunk weight
predict double res, resid

forvalues i=1/16000 {

reg price mpg res
drop res // this is the trouble maker
predict double res, resid
replace res = res + _b[res]*res
mat b = e(b)

reg mpg headroom trunk weight res
drop res // this is the trouble maker
predict double res, resid
replace res = res + _b[res]*res

reg price mpg res
drop res 
predict double res, resid
replace res = res + _b[res]*res
mat bb = e(b)

dis "The last iteration is " `i'

if mreldif(b, bb)<1e-6 continue, break
}

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29911
#11

28 Oct 2018, 18:43

Thank you for all the follow-up. Very interesting all around!
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4900
#12

28 Oct 2018, 19:09

I haven't followed the above carefully. But if it is a bug, it may be the most wildly esoteric bug I have ever seen. Maybe StataCorp will give you some kind of award for this. ;-)

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 18.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3047
#13

29 Oct 2018, 04:32

To summarise the discussion:

1. This seems to be a bug, because Stata does not behave consistently. With some experimentation it appears that Clyde is right, and one cannot rely on Stata consistently setting the missing variable on the right hand side of -predict- to 0. It seems that Stata sets it to whatever, and then replaces it... I will write to Stata Corp to let them know about that as Richard suggested. I will request a prize too (to be shared with Clyde as obviously he has a better idea than me of what is going on here), but I am not very hopeful regarding a positive outcome of this request

2. I think the issue is general enough. As you see in my application the variable *res* is simply used consecutively for intermediate calculations. In such an application it makes perfect sense (at least to me) to call the intermediate variable *res* always by the same name, although *res* has different meaning as we go down along the iterations sequence. All I was trying to do in my code was to be economical and not to clutter my working space with too many res1, res2, res3, etc.

Finally here is the code where I have corrected my silly mistakes, and this code should have worked if Stata was behaving consistently and initialising the missing variable on the right hand side of predict to 0. This code does not work, apparently the replication of - _predict - that I showed a couple of posts above was coincidential and cannot be relied upon.

Code:

sysuse auto, clear reg mpg headroom trunk weight _predict double res, resid forvalues i=1/16000 { reg price mpg res drop res // this will not work _predict double res, resid mat b = e(b) reg mpg headroom trunk weight res drop res // this will not work _predict double res, resid reg price mpg res drop res // this will not work _predict double res, resid mat bb = e(b) reg mpg headroom trunk weight res drop res // this will not work _predict double res, resid dis "The last iteration is " `i' if mreldif(b, bb)<1e-6 continue, break }
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3047
#14

29 Oct 2018, 05:01

And finally finally, for complete completeness, here is the code that correctly implements the Telser's iterative procedure.

As you can see, the code is not as elegant and economical as the wrong code above, because I need to introduce two variables, res and losresidulos. The correct code also requires a bit longer attention span on my side, as I need to keep track of whether I am dealing with res, or with losresidulos.

It would have been nice if either Stata consistently sets the missing variable on the right hand side of - _predict - to 0;
or if Stata somehow was keeping own memory of the variables on which the regression was fit, so that even after I drop *res* (so that I can recycle the name), Stata could still calculate correct prediction. In the second scenario I would have been able to make some version of the first incorrect (but economical) code I posted above work.

Code:

sysuse auto, clear reg mpg headroom trunk weight _predict double res, resid forvalues i=1/16000 { reg price mpg res replace res = 0 _predict double losresidulos, resid drop res mat b = e(b) reg mpg headroom trunk weight losresidulos replace losresidulos = 0 _predict double res, resid drop losresidulos reg price mpg res replace res = 0 _predict double losresidulos, resid mat bb = e(b) drop res reg mpg headroom trunk weight losresidulos replace losresidulos = 0 _predict double res, resid drop losresidulos dis "The last iteration is " `i' if mreldif(b, bb)<1e-6 continue, break }
Comment

Announcement