Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variables with perfect residual but regression seems to fail

    I have a huge financial dataset (panel) from an orderbook. The dataset contains several variables (three) and there is a theory how these three variables should behave. For simplicity let me call the variables var1, var2 and var3. The theory says
    Code:
    var1 - var2 + var3 = const
    but the theory is silent about what the constant is. Now, I performed two set of commands and the result is different and I do not understand both.

    First. Since I have a theory I generated
    Code:
    gen res = var1 - var2 + var3
    and plotted the resulting variable (in fact I subtracted the mean but this is not important at the moment):
    Click image for larger version

Name:	ref.png
Views:	1
Size:	51.6 KB
ID:	1649867


    Now, this looks like a very good normal distribution and it still does if I increase the widths of the histogram. Nevertheless, a formal Kolomogoroff-Smirnoff test fails but this is due to the fact that I have about 16 mio observations.

    Second. On the other hand, if I run a regression
    Code:
    regress var1 var2 var3
    I get into trouble. The result is
    Code:
    var1 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
         var2 |  -.0483928   .0001229  -393.61   0.000    -.0486337   -.0481518
         var3 |   .5579093   .0002678  2083.23   0.000     .5573844    .5584342
           _cons |  -5.389484   .0013784 -3909.99   0.000    -5.392186   -5.386783
    which yields a different relation than the one from the theory above. Performing the usual steps after regression shows that homoskedasticity does not seem to hold. Instead of a formal test I plotted the residuals again, this time as
    Code:
    gen res2 = var1+0.0483928*var2-0.55790*var3
    and obtained this picture which is definitively not normal,
    Click image for larger version

Name:	ref2.png
Views:	1
Size:	52.0 KB
ID:	1649868


    In my opinion my first approach is sufficient because I get a convincing result. My coauthor says that the literature follows the second path and will not accept my first approach. Is there anybody who understands what I am talking about and what is wrong here?

  • #2
    A symmetric bell shape is necessary but not sufficient to declare a distribution normal, or nearly so. qnorm is much more discriminating than a histogram.

    A while back Harold Jeffreys suggested that high-quality distributions in which measurement error was dominant were in practice more like t distributions with about 7 degrees of freedom. Simulating from such a distribution is salutary. Histograms don't hint at the non-normality but normal quantile plots work better.

    Comment


    • #3
      Andreas:
      OLS residuals (epsilon) are the difference between the observed and the fitted values:
      Code:
      . use "C:\Program Files\Stata17\ado\base\a\auto.dta"
      (1978 automobile data)
      
      . regress price mpg
      
            Source |       SS           df       MS      Number of obs   =        74
      -------------+----------------------------------   F(1, 72)        =     20.26
             Model |   139449474         1   139449474   Prob > F        =    0.0000
          Residual |   495615923        72  6883554.48   R-squared       =    0.2196
      -------------+----------------------------------   Adj R-squared   =    0.2087
             Total |   635065396        73  8699525.97   Root MSE        =    2623.7
      
      ------------------------------------------------------------------------------
             price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
               mpg |  -238.8943   53.07669    -4.50   0.000    -344.7008   -133.0879
             _cons |   11253.06   1170.813     9.61   0.000     8919.088    13587.03
      ------------------------------------------------------------------------------
      
      . predict fitted, xb
      
      . predict residual, res
      
      . list price fitted residual in 1
      
           +------------------------------+
           | price     fitted    residual |
           |------------------------------|
        1. | 4,099   5997.385   -1898.385 |
           +------------------------------+
      
      . di 4099-5997.385
      -1898.385
      
      .
      If you switch from -regress- to -xtreg- (if you actually have panel data), things get more complicated as you have ui (the pane-wise term) in addition to epsilon.
      Kind regards,
      Carlo
      (StataNow 18.5)

      Comment


      • #4
        Thanks for all the replies and sorry for being a bit sloppy. I checked the normal distribution in the first histogram and simply forgot the bell curve, here it is
        Click image for larger version

Name:	Bildschirmfoto 2022-02-14 um 08.54.37.png
Views:	1
Size:	38.0 KB
ID:	1649945


        What irritated me was the following. I am having a relation where the residual seems to be perfect normal, namely
        Code:
        var1-var2+var3=const
        so I expected exactly those coefficients when I ran the regression. Instead, different coefficients turned up with a non-normal residual. So, something must be "wrong" and I do not know what. Maybe the residuals are autocorrelated in the first place? Why do I not get the coefficients -1 and +1 even if I have such a nice residual term?

        Comment


        • #5
          Andreas:
          residual normality does not imply the absence of a standard deviation of residual distribution (as you can see from your bell-shaped graph).
          Hence, I fail to get your belief about the cons=res equality.
          Kind regards,
          Carlo
          (StataNow 18.5)

          Comment


          • #6
            Oh no, again I was sloppy (thank you Carlo!). I expect the residuals having a standard deviation, that is ok. What I did not expect was the following. If I "know" (from my first observation) that
            Code:
            var1-var2+var3 <distributed as> normal distribution (mue=const, sigma=any standard deviation)
            then it should follow if I regress
            Code:
            regress var1 var2 var3
            that the coefficients are +1 and -1. And this did not happen.

            Comment


            • #7
              Andreas:
              are you sure that you're not mixing up predictors (that is, your independent variables) with regression coefficients?
              Kind regards,
              Carlo
              (StataNow 18.5)

              Comment


              • #8
                In some sense "yes". I expected my independent variables to show up as regression coefficients and I am wondering why this is not the case. As I write this I will check whether my residuals coming from
                Code:
                var1-var2+var3-const
                are indeed independent from var2, var3. If they are not then this might possibly explain what I am observing.

                Comment

                Working...
                X