Variables with perfect residual but regression seems to fail

Andreas Loeffler

Join Date: Oct 2021

Posts: 13
#1

Variables with perfect residual but regression seems to fail

13 Feb 2022, 11:26

I have a huge financial dataset (panel) from an orderbook. The dataset contains several variables (three) and there is a theory how these three variables should behave. For simplicity let me call the variables var1, var2 and var3. The theory says

Code:

var1 - var2 + var3 = const

but the theory is silent about what the constant is. Now, I performed two set of commands and the result is different and I do not understand both.

First. Since I have a theory I generated

Code:

gen res = var1 - var2 + var3

and plotted the resulting variable (in fact I subtracted the mean but this is not important at the moment):

Now, this looks like a very good normal distribution and it still does if I increase the widths of the histogram. Nevertheless, a formal Kolomogoroff-Smirnoff test fails but this is due to the fact that I have about 16 mio observations.

Second. On the other hand, if I run a regression

Code:

regress var1 var2 var3

I get into trouble. The result is

Code:

var1 | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- var2 | -.0483928 .0001229 -393.61 0.000 -.0486337 -.0481518 var3 | .5579093 .0002678 2083.23 0.000 .5573844 .5584342 _cons | -5.389484 .0013784 -3909.99 0.000 -5.392186 -5.386783

which yields a different relation than the one from the theory above. Performing the usual steps after regression shows that homoskedasticity does not seem to hold. Instead of a formal test I plotted the residuals again, this time as

Code:

gen res2 = var1+0.0483928*var2-0.55790*var3

and obtained this picture which is definitively not normal,

In my opinion my first approach is sufficient because I get a convincing result. My coauthor says that the literature follows the second path and will not accept my first approach. Is there anybody who understands what I am talking about and what is wrong here?
Tags: normal distribution, regression, residuals
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

13 Feb 2022, 11:39

A symmetric bell shape is necessary but not sufficient to declare a distribution normal, or nearly so. qnorm is much more discriminating than a histogram.

A while back Harold Jeffreys suggested that high-quality distributions in which measurement error was dominant were in practice more like t distributions with about 7 degrees of freedom. Simulating from such a distribution is salutary. Histograms don't hint at the non-normality but normal quantile plots work better.
3 likes
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17711

13 Feb 2022, 11:41

Andreas:
OLS residuals (epsilon) are the difference between the observed and the fitted values:

Code:

. use "C:\Program Files\Stata17\ado\base\a\auto.dta"
(1978 automobile data)

. regress price mpg

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =     20.26
       Model |   139449474         1   139449474   Prob > F        =    0.0000
    Residual |   495615923        72  6883554.48   R-squared       =    0.2196
-------------+----------------------------------   Adj R-squared   =    0.2087
       Total |   635065396        73  8699525.97   Root MSE        =    2623.7

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         mpg |  -238.8943   53.07669    -4.50   0.000    -344.7008   -133.0879
       _cons |   11253.06   1170.813     9.61   0.000     8919.088    13587.03
------------------------------------------------------------------------------

. predict fitted, xb

. predict residual, res

. list price fitted residual in 1

     +------------------------------+
     | price     fitted    residual |
     |------------------------------|
  1. | 4,099   5997.385   -1898.385 |
     +------------------------------+

. di 4099-5997.385
-1898.385

.

If you switch from -regress- to -xtreg- (if you actually have panel data), things get more complicated as you have ui (the pane-wise term) in addition to epsilon.

Kind regards,
Carlo
(Stata 19.0)

Comment

Andreas Loeffler

Join Date: Oct 2021

Posts: 13
#4

14 Feb 2022, 00:57

Thanks for all the replies and sorry for being a bit sloppy. I checked the normal distribution in the first histogram and simply forgot the bell curve, here it is

What irritated me was the following. I am having a relation where the residual seems to be perfect normal, namely

Code:

var1-var2+var3=const

so I expected exactly those coefficients when I ran the regression. Instead, different coefficients turned up with a non-normal residual. So, something must be "wrong" and I do not know what. Maybe the residuals are autocorrelated in the first place? Why do I not get the coefficients -1 and +1 even if I have such a nice residual term?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17711
#5

14 Feb 2022, 01:13

Andreas:
residual normality does not imply the absence of a standard deviation of residual distribution (as you can see from your bell-shaped graph).
Hence, I fail to get your belief about the cons=res equality.

Kind regards,
Carlo
(Stata 19.0)
Comment
Andreas Loeffler

Join Date: Oct 2021

Posts: 13
#6

14 Feb 2022, 02:59

Oh no, again I was sloppy (thank you Carlo!). I expect the residuals having a standard deviation, that is ok. What I did not expect was the following. If I "know" (from my first observation) that

Code:

var1-var2+var3 <distributed as> normal distribution (mue=const, sigma=any standard deviation)

then it should follow if I regress

Code:

regress var1 var2 var3

that the coefficients are +1 and -1. And this did not happen.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17711
#7

14 Feb 2022, 14:45

Andreas:
are you sure that you're not mixing up predictors (that is, your independent variables) with regression coefficients?

Kind regards,
Carlo
(Stata 19.0)
Comment
Andreas Loeffler

Join Date: Oct 2021

Posts: 13
#8

15 Feb 2022, 01:49

In some sense "yes". I expected my independent variables to show up as regression coefficients and I am wondering why this is not the case. As I write this I will check whether my residuals coming from

Code:

var1-var2+var3-const

are indeed independent from var2, var3. If they are not then this might possibly explain what I am observing.
Comment

Announcement

Variables with perfect residual but regression seems to fail

Comment

Comment

Comment

Comment

Comment

Comment

Comment