Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mysterious behaviour of -predict- after regress, when a regressor went missing

    Good afternoon,

    I encountered the following mysterious behaviour of -predict- , which I cannot rationalise by anything that I know regarding how Stata works.

    1. I run a regression of price on mpg, and I predict the residual. (Standard scenario, the benchmark. )

    2. I drop the regressor mpg, and try again to predict.

    a) When I call the predicted value an arbitrary name, Stata behaves as expected and tells me that it cannot find the regressor mpg, and hence is unable to calculate the predictions.

    b) The mystery occurs when I call the predicted values by the name of the missing regressor mpg. Stata does not report any problem, and calculates something, I am not sure what...

    Code:
    . sysuse auto, clear
    (1978 Automobile Data)
    
    . keep price mpg
    
    . keep in 1/5
    (69 observations deleted)
    
    . qui reg price mpg
    
    . predict correctresidual, resid
    
    . list correctresidual
    
         +-----------+
         | correct~l |
         |-----------|
      1. |  293.3505 |
      2. |  -1292.99 |
      3. | -6.649485 |
      4. |  115.8144 |
      5. |  890.4742 |
         +-----------+
    
    . drop mpg
    
    . predict wrongresidual, resid
    variable mpg not found
    r(111);
    
    . predict mpg, resid
    
    . list mpg
    
         +-----------+
         |       mpg |
         |-----------|
      1. |  8.00e+24 |
      2. |  4.25e+24 |
      3. | -9846.547 |
      4. |  1.38e+35 |
      5. |  3.22e+34 |
         +-----------+
    
    .
    Does anyone have any guess what just happened here?

  • #2
    if the above shows what you actually did, the problem is that you dropped "mpg" and then tried to use it (-predict- calls on any variable in the regression) and Stata correctly tells you it cannot be found - then you formed "mpg" with your last -predict- command; if you are having trouble with this, please look at the help for each of the commands you are using

    Comment


    • #3
      Once you dropped mpg, then mpg is available as a (new) variable name for whatever you want. Using mpg as a new name is emphatically not a way to retrieve the original mpg.

      Comment


      • #4
        Hi Nick, that mpg becomes available for a new variable is clear.

        What is not clear to me is to what the right hand side of the expression

        predict mpg, resid

        which is implicitly

        gen mpg = price - _b[_cons] - _b[mpg]*mpg

        evaluates when the right hand side is missing, because I dropped mpg? or apparently not missing to Stata, as Stata goes through and carries out some calculation.

        I am not trying to recover mpg. The example above is a toy example illustrating mysterious behaviour which got me puzzled.



        Comment


        • #5
          Here is what happens when I do by hand what Stata is supposed to do when I call -predict-. Nothing mysterious happens here, Stata behaves as expected.

          The puzzle is why the outcome above, when using -predict-, is not the same as the outcome below when I manually generate the prediction

          Code:
          . sysuse auto, clear
          (1978 Automobile Data)
          
          . keep in 1/5
          (69 observations deleted)
          
          . keep price mpg
          
          . reg price mpg
          
                Source |       SS           df       MS      Number of obs   =         5
          -------------+----------------------------------   F(1, 3)         =      9.08
                 Model |  7761889.59         1  7761889.59   Prob > F        =    0.0571
              Residual |  2564278.41         3  854759.471   R-squared       =    0.7517
          -------------+----------------------------------   Adj R-squared   =    0.6689
                 Total |    10326168         4     2581542   Root MSE        =    924.53
          
          ------------------------------------------------------------------------------
                 price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
                   mpg |   -447.268   148.4247    -3.01   0.057    -919.6216    25.08551
                 _cons |   13645.55   2879.592     4.74   0.018     4481.401    22809.69
          ------------------------------------------------------------------------------
          
          . drop mpg
          
          . gen double mpg = price - _b[_cons] - _b[mpg]*mpg
          mpg not found
          r(111);
          
          .

          Comment


          • #6
            Yes indeed; good question. The results don't make sense as residuals but I agree: why you get results at all is a mystery.

            Comment


            • #7
              Well, predict.ado calls a built-in command _predict, so only StataCorp knows for sure what is happening here.

              Based on general experience with programming languages, here's my best guess as to what is going on:

              After dropping mpg, when you run -predict mpg, resid-, Stata begins by creating a new variable mpg after duly verifying that no such variable already exists, and one of the early steps in that is probably requesting that the operating system allocate some memory for that. That memory is given over to Stata and it contains whatever happened to be left in it by the last program that happened to use that block of memory. This is known technically as "filled with garbage." Stata uses that memory and duly assigns it the variable name mpg. Next it notes that you want the residual. In preparing to calculate the residual, Stata notes that a variable named mpg was, indeed, among the predictors in the regression, and that variable mpg still exists, so it procedes to calculate price - _b[_cons] - _b[mpg]*mpg. All of these exist (dropping -mpg- does not erase _b[mpg]), and the calculation proceeds. And then the results are stored in variable mpg. The problem, of course, is that the values of mpg that are used to calculate the residuals are the garbage that happened to be in the allocated memory and have nothing to do with the actual data.

              Bear in mind that there is no requirement, with -predict-, that the regression variables be the same as they were when the regression ran. It is intended that -predict- can be run with counterfactual data that is replaced into the predictor variables after the regression: that is how -margins- works. What has gone wrong in your situation is that in this case the counterfactual data is garbage.

              That said, it is poor design, if not an outright bug, to allow -predict newvar- where newvar is the same as a variable used as a predictor in the regression. In fact, if you hadn't -drop-ped mpg and tried to -predict mpg, resid-, Stata would complain that mpg already exists. At the point where it does that, better design would also be for Stata to verify that the new variable is not, itself, a variable in the regression.

              Added: I'm curious how you stumbled upon this. The -predict- command has been around for a long time, and as far as I know, nobody has ever noticed this problem before. I can't think of any situation where I would intentionally want to -drop mpg- and then -predict mpg-. So did this just happen by accident: some typo or copy/paste error or something like that? Or were you deliberately testing -predict- and thought of this as a test case? Or is there some actual use case for -predict- where -drop mpg- followed by -predict mpg- might make sense? If the last, what is the use case, and what results were you expecting to get?

              Further added: You can test my hypothesis that variable mpg is being populated with garbage by shutting down Stata. Run some other programs that use up a lot of memory. Then re-launch Stata and do the -drop mpg- -predict mpg, resid- routine again. If I am correct, the results will be equally mysterious looking, but different from what you got the first time.
              Last edited by Clyde Schechter; 28 Oct 2018, 18:14.

              Comment


              • #8
                Hi Clyde.

                The funny part of the mystery is that I managed to figure out what the built in - _predict - does. I replicated the "bug," when the call is to - _predict -, (but not to -predict-).

                I have long personal history where the command -replace- regularly backstabs me, at occasions where I try to be economical and save on creation of new variables. So my first guess before I read your post was that somewhere behind the scenes the backstabbing -replace- is out to get me again.

                Here is a call to _predict, followed by manual replication involving -replace- that gives the same (buggy) result. And yes, -replace- is at action here as I expected:

                Code:
                . sysuse auto, clear
                (1978 Automobile Data)
                
                . keep in 1/5
                (69 observations deleted)
                
                . keep price mpg
                
                . qui reg price mpg
                
                . drop mpg
                
                . _predict mpg, resid
                
                . list mpg
                
                     +-----------+
                     |       mpg |
                     |-----------|
                  1. | -9546.547 |
                  2. | -8896.547 |
                  3. | -9846.547 |
                  4. | -8829.547 |
                  5. | -5818.546 |
                     +-----------+
                Now I am going to get the same by -generate- and -replace-

                Code:
                . sysuse auto, clear
                (1978 Automobile Data)
                
                . keep in 1/5
                (69 observations deleted)
                
                . keep price mpg
                
                . qui reg price mpg
                
                . drop mpg
                
                . gen double mpg = 0
                
                . replace mpg = price - _b[_cons] - _b[mpg]*mpg
                (5 real changes made)
                
                . list mpg
                
                     +------------+
                     |        mpg |
                     |------------|
                  1. | -9546.5464 |
                  2. | -8896.5464 |
                  3. | -9846.5464 |
                  4. | -8829.5464 |
                  5. | -5818.5464 |
                     +------------+
                
                .
                The (buggy) result is the same.




                Comment


                • #9
                  Regarding Clyde's question how I stumbled upon this issue. I was doing something silly. I thought that I could use the variable *res* in the regression, then drop it, and still after dropping it generate a legitimate prediction based on as if *res* is still present in my data. Because this is silly, I figured out that there is some problem pretty quickly.

                  There is no need for explaining that the code below is silly, I have fixed the code now and it does what I want. I am providing the wrong code just to illustrate how I came upon the issue with -predict-.

                  So I am implementing iterative Telser system estimator. I am copying short explanation of the procedure from my previous post
                  https://www.statalist.org/forums/for...inate-the-loop

                  For concreteness say I have a system of two equations:
                  (1) y = x'b + e
                  (2) w = z'g + v.

                  I want to implement the following procedure due to Telser, L. G. (1964). Iterative estimation of a set of linear regression equations. Journal of the American Statistical Association, 59(307), 845-862.

                  Estimate (2) by OLS, get the residuals v(0) = w - z'g(0).
                  Estimate modification of (1) by OLS: y = x'b(0) + a*v(0) + error. Get the residual e(0) = y - x'b(0).
                  Estimate modification of (2) by OLS: w = z'g(1) + c*e(0) + error. Get the residual v(1) = w - z'g(1)
                  Estimate modification of (1) by OLS: y = x'b(1) + a*v(1) + error. Get the residual e(1) = y - x'b(1).
                  Estimate modification of (2) by OLS: w = z'g(2) + c*e(1) + error. Get the residual v(2) = w - z'g(2)
                  ................. Repeat until b(n) converges
                  And here is the wrong code which attempted to do the above "economically"
                  Code:
                  forvalues i=1/16000 {
                  
                  reg mpg headroom trunk weight
                  predict double res, resid
                  
                  reg price mpg res
                  drop res // this is the trouble maker
                  predict double res, resid
                  replace res = res + _b[res]*res
                  mat b = e(b)
                  
                  reg mpg headroom trunk weight res
                  drop res // this is the trouble maker
                  predict double res, resid
                  replace res = res + _b[res]*res
                  
                  reg price mpg res
                  drop res 
                  predict double res, resid
                  replace res = res + _b[res]*res
                  mat bb = e(b)
                  
                  dis "The last iteration is " `i'
                  
                  if mreldif(b, bb)<1e-6 continue, break
                  }

                  Comment


                  • #10
                    Or maybe the code was like this, I already changed it to something that it works and I deleted the wrong version, so I am repeating it from my memory:

                    Code:
                    sysuse auto, clear
                    
                    reg mpg headroom trunk weight
                    predict double res, resid
                    
                    forvalues i=1/16000 {
                    
                    reg price mpg res
                    drop res // this is the trouble maker
                    predict double res, resid
                    replace res = res + _b[res]*res
                    mat b = e(b)
                    
                    reg mpg headroom trunk weight res
                    drop res // this is the trouble maker
                    predict double res, resid
                    replace res = res + _b[res]*res
                    
                    reg price mpg res
                    drop res 
                    predict double res, resid
                    replace res = res + _b[res]*res
                    mat bb = e(b)
                    
                    dis "The last iteration is " `i'
                    
                    if mreldif(b, bb)<1e-6 continue, break
                    }

                    Comment


                    • #11
                      Thank you for all the follow-up. Very interesting all around!

                      Comment


                      • #12
                        I haven't followed the above carefully. But if it is a bug, it may be the most wildly esoteric bug I have ever seen. Maybe StataCorp will give you some kind of award for this. ;-)
                        -------------------------------------------
                        Richard Williams, Notre Dame Dept of Sociology
                        Stata Version: 17.0 MP (2 processor)

                        EMAIL: [email protected]
                        WWW: https://www3.nd.edu/~rwilliam

                        Comment


                        • #13
                          To summarise the discussion:

                          1. This seems to be a bug, because Stata does not behave consistently. With some experimentation it appears that Clyde is right, and one cannot rely on Stata consistently setting the missing variable on the right hand side of -predict- to 0. It seems that Stata sets it to whatever, and then replaces it... I will write to Stata Corp to let them know about that as Richard suggested. I will request a prize too (to be shared with Clyde as obviously he has a better idea than me of what is going on here), but I am not very hopeful regarding a positive outcome of this request

                          2. I think the issue is general enough. As you see in my application the variable *res* is simply used consecutively for intermediate calculations. In such an application it makes perfect sense (at least to me) to call the intermediate variable *res* always by the same name, although *res* has different meaning as we go down along the iterations sequence. All I was trying to do in my code was to be economical and not to clutter my working space with too many res1, res2, res3, etc.

                          Finally here is the code where I have corrected my silly mistakes, and this code should have worked if Stata was behaving consistently and initialising the missing variable on the right hand side of predict to 0. This code does not work, apparently the replication of - _predict - that I showed a couple of posts above was coincidential and cannot be relied upon.

                          Code:
                          sysuse auto, clear
                          
                          reg mpg headroom trunk weight
                          _predict double res, resid
                          
                          forvalues i=1/16000 {
                          
                          reg price mpg res
                          drop res // this will not work
                          _predict double res, resid
                          mat b = e(b)
                          
                          reg mpg headroom trunk weight res
                          drop res // this will not work
                          _predict double res, resid
                          
                          reg price mpg res
                          drop res // this will not work
                          _predict double res, resid
                          mat bb = e(b)
                          
                          reg mpg headroom trunk weight res
                          drop res // this will not work
                          _predict double res, resid
                          
                          dis "The last iteration is " `i'
                          
                          if mreldif(b, bb)<1e-6 continue, break
                          }




                          Comment


                          • #14
                            And finally finally, for complete completeness, here is the code that correctly implements the Telser's iterative procedure.

                            As you can see, the code is not as elegant and economical as the wrong code above, because I need to introduce two variables, res and losresidulos. The correct code also requires a bit longer attention span on my side, as I need to keep track of whether I am dealing with res, or with losresidulos.

                            It would have been nice if either Stata consistently sets the missing variable on the right hand side of - _predict - to 0;
                            or if Stata somehow was keeping own memory of the variables on which the regression was fit, so that even after I drop *res* (so that I can recycle the name), Stata could still calculate correct prediction. In the second scenario I would have been able to make some version of the first incorrect (but economical) code I posted above work.

                            Code:
                            sysuse auto, clear
                            
                            reg mpg headroom trunk weight
                            _predict double res, resid
                            
                            forvalues i=1/16000 {
                            
                            reg price mpg res
                            replace res = 0 
                            _predict double losresidulos, resid
                            drop res
                            mat b = e(b)
                            
                            reg mpg headroom trunk weight losresidulos
                            replace losresidulos = 0
                            _predict double res, resid
                            drop losresidulos
                            
                            reg price mpg res
                            replace res = 0
                            _predict double losresidulos, resid
                            mat bb = e(b)
                            drop res
                            
                            reg mpg headroom trunk weight losresidulos
                            replace losresidulos = 0
                            _predict double res, resid
                            drop losresidulos
                            
                            dis "The last iteration is " `i'
                            
                            if mreldif(b, bb)<1e-6 continue, break
                            }

                            Comment

                            Working...
                            X