Normality of residual term

Elizabete Lurie

Join Date: May 2015
Posts: 34

Normality of residual term

05 Jun 2015, 12:08

Hello! I run the skewness and kurtosis test as well as Shapiro-Wilk normality test and they both rejected my null hypothesis that my residuals are normal as shown below. I am a bit unsure how should I take this into consideration for my regression analysis? Thank you in advance!

Code:

  sktest Residuals

                    Skewness/Kurtosis tests for Normality
                                                         ------- joint ------
    Variable |    Obs   Pr(Skewness)   Pr(Kurtosis)  adj chi2(2)    Prob>chi2
-------------+---------------------------------------------------------------
   Residuals |   2.5e+03   0.0000         0.0000            .              .

. swilk Residuals

                   Shapiro-Wilk W test for normal data

    Variable |    Obs       W           V         z       Prob>z
-------------+--------------------------------------------------
   Residuals |   2477    0.61306    557.318    16.209    0.00000

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29847
#2

05 Jun 2015, 12:22

At the risk of being glib, I would just ignore them. (Actually, I wouldn't have done them in the first place.) The basic theory of inference from linear regression is based on the assumption that the residuals are normally distributed. But in fact there is a vast literature establishing that the inferences are pretty robust to violations of that assumption in a wide variety of circumstances. In particular, the tests you have done are very sensitive at picking up departures from normality that are too small to really matter in terms of invalidating inferences from regression. Why don't you run -qnorm Residuals- and see whether the graph suggests a substantial departure from normality. That's a far less sensitive test of normality, but it works much better as an indicator of whether you need to worry about it.
Comment
Elizabete Lurie

Join Date: May 2015

Posts: 34
#3

05 Jun 2015, 12:44

Thanks a lot! What would be a good rule of thumb for assuming that you should not have to worry about your residuals? for me the deviations do not seem that drastic, but not sure if that is really the case. Thanks you in advance!

1 Photo
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29847
#4

05 Jun 2015, 13:29

Well, my reaction to that graph is that it's a pretty substantial departure from normality. The residuals don't seem to reach down into the lower range of values nearly as much as a normal distribution would, for one thing. And the distribution looks pretty asymmetric. Now, you do have a decent sample size, and even with highly non-normal distributions, for some models inference will be good even in the face of severe non-normality. And inference may not even be important for your purposes. So, I think you need to describe your model in some detail and also tell us what your underlying research questions are (i.e. what are you trying to learn from your model) to get more specific advice on how to proceed from here.
Comment
Elizabete Lurie

Join Date: May 2015

Posts: 34
#5

05 Jun 2015, 13:44

Thanks! Well my regression is as follows:

[CODE] regress CR_POM Tech_Industry Int_CR_POM_Tech Size Profitability
/CODE]

Where the independent variables are:

1) CR_POM (dummy variable) measuring if a firm is close to a credit rating change or not
2) Tech_Industry (dummy variable) measuring if the firm is in Tech industry or not
3) Int_CR_POM_Tech is an interaction between CR_POM*Tech_Industry
4) Size and Profitability measure firm size and profitability and are control variables

Firstly, I am testing if firms that are close to a credit rating change decrease their investment spending in contrast to firms that are not close to a credit rating change. And secondly, I test if this effect will be more prominent in Tech_Industry, Namely, if firms in high Tech industries will decrease their investment spending more when close to a credit rating change in comparison to firms not in Tech industry. Thank you in advance!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29847
#6

05 Jun 2015, 14:02

A few thoughts:

1. The command you show doesn't seem to include a dependent variable: all of the variables in it you list as independent variables. Stata will take this command to use CR_POM as the independent variable. I'm going to assume, though, that you just typed the command incorrectly in your post and that an independent variable is actually in use.

2. This looks to me like the kind of problem that is ordinarily approached using panel data. That is, you have many firms in your data and that the firms typically have data for multiple time periods. If I am right about that, then your model probably suffers from omitted variable bias because there will be other attributes of the firms that affect their investment spending and may well also be related to CR_POM and Tech_Industry. IAlso, it would follow that the error terms within firms are not independent. So I would be inclined to model this with -xtreg, fe- rather than pooled linear regression. And if you do that, by accounting for the fixed effects, you may find the residual distribution becomes closer to normal.

3. Going far out on a limb because I know so little about this subject matter, I would think that investment spending could have a very wild and very heavy-tailed distribution, even conditional on all the variables in your model. While Size and Profitability might account for a fair amount of the variation, I would still expect a pretty complicated distribution of investment spending even accounting for those factors. So another thought is to try to explore the possibility of transformations of Size or Profitability that might better capture the relationship of those attributes to investment spending than just entering them linearly in the model. In addition to making your model a better description of the real world, that approach, too, might leave you with a less disturbing residual distribution.

Summarizing my thoughts, in this case I think that the residual distribution problem is not the primary problem, but rather that it is a symptom of a mis-specified model. There are probably other ways to handle this besides what I have alluded to above. I think you would actually get better advice from a good econometrician than I can give you. Hopefully one of the many who follow this Forum will chime in.
1 like
Comment
Enrique Pinzon (StataCorp)

StataCorp Employee

Join Date: Jan 2015

Posts: 213
#7

05 Jun 2015, 15:14

Hello Elizabeth and Clyde,

The regression model does not require normality. This is a common belief but it is not true. If you want to obtain the best linear unbiased estimator (BLUE) properties you do. But if you want a consistent estimator with correct coverage you are fine if the unobserved random components of your model are not normal. In fact, all you need is:

\begin{eqnarray}
y &=& X\beta + \varepsilon \\
E\left[\varepsilon|X\right] &=& 0
\end{eqnarray}

As long as the unobserved component of your model is mean independent of the observables (regressors), you are fine. I will illustrate this using an example where the unobserved component is a mean zero chi-squared distribution with one degree of freedom:

Code:

. clear . set seed 111 . set obs 1000 number of observations (_N) was 0, now 1,000 . // Generating regressors and unobserved components . generate e = rchi2(1) - 1 . generate x1 = rchi2(1) . generate x2 = rbeta(2,3) . // Generating dependent variable . generate y = 3+ 3*x1-3*x2 + e . // Fitting regression . regress y x1 x2 Source | SS df MS Number of obs = 1,000 -------------+---------------------------------- F(2, 997) = 4040.77 Model | 18497.9922 2 9248.99612 Prob > F = 0.0000 Residual | 2282.05056 997 2.28891731 R-squared = 0.8902 -------------+---------------------------------- Adj R-squared = 0.8900 Total | 20780.0428 999 20.8008436 Root MSE = 1.5129 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1 | 3.045575 .0343296 88.72 0.000 2.978208 3.112942 x2 | -3.13332 .2448836 -12.80 0.000 -3.613866 -2.652773 _cons | 3.08467 .1160442 26.58 0.000 2.856951 3.312389 ------------------------------------------------------------------------------

Notice that you are close to the true parameter estimates (3 for x1 , -3 for x2, and 3 for the constant) and the true value is inside the confidence interval.

This works because the mathematical conditions imply that:

\begin{eqnarray}
y &=& X\beta + \varepsilon \\
E\left[\varepsilon|X\right] &=& 0 \\
E\left[y|X\right] &=& X\beta
\end{eqnarray}

In other words, if the unobservables are mean independent of the regressors, your estimate of the conditional mean is just a linear combination of the regressors. This is what you get from a regression.

Last edited by Enrique Pinzon (StataCorp); 05 Jun 2015, 15:18.
3 likes
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 2974
#8

05 Jun 2015, 15:28

Dear All,

I think there is a small mistake in Enrique's reply above: normality is not even needed for OLS to be BLUE, it is only needed to perform exact inference. With large enough samples, normality is not needed at all.

All the best,

Joao
3 likes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29847
#9

05 Jun 2015, 16:44

Thank you , Enrique and Joao. The gist of what I was thinking here was starting from Elizabete's query about normality. From that, my first thought is that there might be a problem about (exact) inference. So I spoke, at first to that issue suggesting that the non-normality might be mild enough to forget about. The -qnorm- graph suggested to me that the non-normality was fairly severe. So my next concern was whether her model was likely to support nearly-exact inference even so. The sample size of ~2500 struck me as being borderline in that regard and might depend on model specifics. So I asked for more details about her model. Seeing the model and thinking about it a bit, it struck me that the outcome variable and the specification of the covariates were likely to lead to an unusual residual distribution and my intuition about the model is that it is, in any case, mis-specified. I'm no econometrician, to be sure, but just some real-world experience suggested to me that investment expenses would not likely be a linear function of firm size and profitability. So at that point I was really not thinking about normality as the issue any more: exact inference from a mis-specified model doesn't mean very much! I also noticed that a pooled regression was being carried out on what was likely to be panel data--which could be another source of bias as well as leading to an unusual residual distribution. So by that point, I was basically trying to direct Elizabete away from thinking about normality and dealing with these other issues. Re-reading my posts, I'm not sure I made my thinking clear.

All of that said, it is really nice to see the criteria for consistency, BLUE, and exact inference spelled out clearly (even though I think Elizabete has bigger fish to fry at themoment.) in one place. Thanks for that.
2 likes
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 2974
#10

05 Jun 2015, 17:00

Dear All,

I fully subscribe to Clyde's view: Elizabete needs to focus on the specification of the model rather than worry about normality.

For those of us who teach regression, maybe we should think again about the way we discuss normality. As Enrique noted, there is a widespread misconception that OLS somehow requires normal errors. With normal errors we can perform exact inference and OLS is efficient in the Cramér-Rao sense (because it is the MLE), but most practitioners are happy with much less than that.

All the best,

Joao
2 likes
Comment
Elizabete Lurie

Join Date: May 2015

Posts: 34
#11

06 Jun 2015, 05:49

Thank you all for your elaboration upon the topic. I see your point in regard to my model and that improvements should be made.
Comment

Announcement

Normality of residual term

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment