Normality of residuals and heteroskedasticity

Warner de Jong

Join Date: May 2017

Posts: 19
#1

Normality of residuals and heteroskedasticity

29 May 2017, 13:50

Dear forum,

I am checking the assumptions for using a multiple regression model. The dependent is a continuous variable. The independent variables are both continuous and dummy variables. Concerning the assumptions. I already checked for outliers. Yet, I am experiencing difficulty with the other assumptions. Perhaps I should use a different model?

When checking for homoskedasticity using the "estat hettest" and "estat imtest, white" commands, I got very different results. The hettest shows that heteroskedasticity is present whereas the imtest, white doest not. The results confuse me about how to continue with my model. Furthermore, I had checked for the normality of the residuals using an sktest and found that my residuals are not normally distributed either. The dependent variable is however close to a normal distribution. if that may help.

Thank you for your time,
Warner

. estat hettest

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of Post_ROA

chi2(1) = 13.94
Prob > chi2 = 0.0002

. estat imtest, white

White's test for Ho: homoskedasticity
against Ha: unrestricted heteroskedasticity

chi2(20) = 20.23
Prob > chi2 = 0.4439

Cameron & Trivedi's decomposition of IM-test

---------------------------------------------------
Source | chi2 df p
---------------------+-----------------------------
Heteroskedasticity | 20.23 20 0.4439
Skewness | 8.98 5 0.1099
Kurtosis | 5.87 1 0.0154
---------------------+-----------------------------
Total | 35.08 26 0.1100
---------------------------------------------------
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35811
#2

29 May 2017, 14:56

Looking at residuals is of some use but you're asking for advice on a model that you don't show us. Is it a plain regression or something else? What is the sample size and how many parameters are you estimating? Do all the predictors look good?

It is unfortunate that many texts and courses seem to counsel focus on such tests when whether Y = Xb is a suitable structure is the most important question of all and residual plots, including added variable plots, the most valuable diagnostics.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17749
#3

29 May 2017, 23:55

Warner:
I do share Nick's comments and I would add that it is also unfortunate that most courses underlines the prererquisite of normality for dependent variable.
As an aside, the postestimation test you reported use a different number of parameters: hence, this feature can explain why you got different results.
Eventually, if the results of a visual inspection of your residual distribution worry you, you can robustify your standard error and go on with -regression-..
Heteroskedasticity per se is seldom a worrisome nuisance: instead, you should rule out that heteroskedasticity is not a warning light for omitted variable bias (or better, non-linearity of the relationship between a given predictor and the dependent variable), which is absolutely more catastrophic for your estimates (eg, endogeneity).

Kind regards,
Carlo
(Stata 19.0)
Comment
Warner de Jong

Join Date: May 2017

Posts: 19
#4

30 May 2017, 03:29

Dear Nick and Carlo,

My apologies for not defining my model properly. I'll try to do so now.

I compiled a dataset of around 130 CEO successions of about 120 different companies. I'm wanting to test the relationship between post-succession ROA and CEO type in moderation of board composition. So, in short a moderation relationship.

DV: post-succession ROA (continuous) as a 3-year average
IV1: CEO type (nominal with 3 types)
IV2: Board composition (nominal with 2 types)

I have several other control variables,
- If the previous CEO was also chairman (dummy)
- if the previous is chairman now (dummy)
- if the current CEO is chairman (dummy)
- Year (nominal)
- Industry SIC 2-digit (nominal)
- Board size (continuous)
- pre-succession ROA (continuous) as a 3-year average
- Industry ROA (continuous as a 3-year average.

The only reason I thought that normality of residuals would be important is because I am testing a variety of hypotheses. I read that without normality of residuals I cannot do hypothesis testing. As such, I am looking for a solution.

About the hypotheses. The first set of hypotheses looks at whether each CEO type (3 types) affect firm performance (ROA)
The second set looks at the interaction effect of board composition on each CEO type to firm performance (ROA)

Up till now, I tried a formula where first all the continuous control variables are put into the equation, then the nominal controls, then each IV is added separately to the equation (to infer their effects) and lastly the interaction variables. As a whole, it looks like this:

1. reg Post_ROA c.Pre_ROA c.Boardsize c.Logsales c..Industry_ROA
2. reg Post_ROA c.Pre_ROA c.Boardsize c.Logsales c..Industry_ROA i.PrevCEOischair i.PrevCEOduality i.CEOduality i.year i.SIC2
3. reg Post_ROA c.Pre_ROA c.Boardsize c.Logsales c..Industry_ROA i.PrevCEOischair i.PrevCEOduality i.CEOduality i.year i.SIC2 i.CTYPE
4. reg Post_ROA c.Pre_ROA c.Boardsize c.Logsales c..Industry_ROA i.PrevCEOischair i.PrevCEOduality i.CEOduality i.year i.SIC2 i.CTYPE i.BTYPE
5. reg Post_ROA c.Pre_ROA c.Boardsize c.Logsales c..Industry_ROA i.PrevCEOischair i.PrevCEOduality i.CEOduality i.year i.SIC2 i.CTYPE i.BTYPE i.CTYPE#i.BTYPE

I first input the continuous variables as when checking for the assumptions of linearity, I only needed to look at the continuous variables as linearity for nominal variables is automatically fulfilled. Alteast, I read that this was the case.

Hope this helps to clarify
Comment
Warner de Jong

Join Date: May 2017

Posts: 19
#5

30 May 2017, 03:46

Originally posted by Nick Cox View Post

Do all the predictors look good?.

Does this mean that my variables significantly correlate? I do not understand correctly what makes a good predictor.

Originally posted by Carlo Lazzaro View Post

if the results of a visual inspection of your residual distribution worry you, you can robustify your standard error and go on with -regression-..

I looked at an rvfplot of the residuals and see that they follow a certain band distribution. So, because of the result of the hettest, I should now robust my standard errors? Does this mean that I can still test my hypotheses? If not, should I change towards a different statistical model?

Originally posted by Carlo Lazzaro View Post

you should rule out that heteroskedasticity is not a warning light for omitted variable bias (or better, non-linearity of the relationship between a given predictor and the dependent variable), which is absolutely more catastrophic for your estimates (eg, endogeneity).

If I understand correctly, omitted variable bias is more catastrophic for my estimates? Because board composition has an endogeneity problem. As boards influence the performance of a firm (ROA), the performance of a firm influences the future composition of the board.

Last edited by Warner de Jong; 30 May 2017, 03:53.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17749
#6

30 May 2017, 04:00

Warner:
my previous remark about the pointless of normality referred to the -depvar- (as you stated that

The dependent variable is however close to a normal distribution

).
Conversely, it is wise to visually inspecting te residual distribution.
I do not follow your approach of including continuous variables first and then going on as you described.
Regression models should give a fair and true view of the data generating process underlying the population from which your sample has been drawn. Conversely, you seem (as it is often the case) to hunt for the "best" (whatever that means) regression model. Rather, I would recommend you to look at the literature in your research field and see what others did in the past when prsented with the same research topic.
Please note that the effect of each predictor is adjusted for the other ones; put differently, it is hard to desentagle the effect of each independent variable precisely.That said, you can consider Adj-R squared to avoid including inefficient predictors.
Eventually (provided that this is not may resarch field), I would check whether your regression models suffers from endogeneity: for instance, does CEO ability (in doing business; creating relationships or the like) influence at the same time, -CEO_type- and/or -Log_sales- and -Post_ROA- ( I assume that ROA stays for return on (net) actvities, if what I learnt in the past millennium still holds)?
As a coding-related aside -c..Industry_ROA- shoud be -c.Industry_ROA-.

PS: crossed in the cyberspace with Warner's reply, who is wisely wondering whether his regression models suffer from endogeneity.

Last edited by Carlo Lazzaro; 30 May 2017, 04:02.

Kind regards,
Carlo
(Stata 19.0)
Comment
Warner de Jong

Join Date: May 2017

Posts: 19
#7

30 May 2017, 05:39

Dear Carlo,

It might be true that I'm hunting for the best regression model. I'm following the structure of a paper that also uses CEO type, yet a different interaction variable. The authors have used hierarchical multiple linear regression. They have first made a control model and then added models that included the IVs and interaction. Seeing as that I'm still learning about statistics and that this paper has been cited very often, I opted for copying their hierarchical MLR approach. What I am really looking for is to have a resultant table that would show me a model per column in which the sign, size and significance of variables in my regression is shown. I would then assess the models based on their adjusted r2, AIC and BIC, and move to refusing or not refusing my hypotheses.

ROA means return on assets. It is a operational performance indicator calculated as the net income of a firm divided by its total assets

With the dependent variable being close to a normal distribution i mean that Post_ROA histogram looks bell-shaped yet has a little tail to its left.

Concerning the residuals I would post an image of the graphs but I think that's not allowed right? (edit: please see below) I regress:

5. reg Post_ROA c.Pre_ROA c.Boardsize c.Logsales c..Industry_ROA i.PrevCEOischair i.PrevCEOduality i.CEOduality i.year i.SIC2 i.CTYPE i.BTYPE i.CTYPE#i.BTYPE

I get the following:

. sktest r

Skewness/Kurtosis tests for Normality
------ joint ------
Variable | Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2
---------+---------------------------------------------------------------
r | 120 0.0035 0.0064 13.23 0.0013

swilk r

Shapiro-Wilk W test for normal data

Variable | Obs W V z Prob>z
----------------+------------------------------------------------------
r | 120 0.94935 4.874 3.548 0.00019

Because the result of the Shapiro-Wilk test (swilk, r) gives a probability of .00019, and my sktest gives 0.0013 I infer that my residuals are not normally distributed.

Furthermore, if I plot my DV against the residuals of the model, I get a thick diagonal line which start at the bottom left and moves to the bottom right. Does this infer anything about my model? I used the following command

.scatter Post_ROA r

Last edited by Warner de Jong; 30 May 2017, 06:09.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35811
#8

30 May 2017, 05:46

Graphs may and should be shown as .png attachments.
Comment
Warner de Jong

Join Date: May 2017

Posts: 19
#9

30 May 2017, 05:57

Thank you Nick.

scatter Post_ROA r

Last edited by Warner de Jong; 30 May 2017, 06:11.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17749
#10

30 May 2017, 06:56

Warner:
- you're right in correcting me about ROA (-attività- is the Italian word for -assets- and I stumbled upon a bad translation);
- you're taking about a hierachical model. This is something you cannot achieve via -regress-; see -mixed- instead;
- I do not see any substantive pattern in your residuals, set aside a nasty behaviour of the distribution around the tails.

Kind regards,
Carlo
(Stata 19.0)
Comment
Warner de Jong

Join Date: May 2017

Posts: 19
#11

30 May 2017, 07:06

Carlo:

Could you tell me what -mixed- means? and why should I use that instead of -regress-?

do you mean it like this?
. Mixed Post_ROA Pre_ROA Boardsize LSales Ind_ROAmed i.PC_chairman i.PC_duality i.Duality i.Year i.SIC_2 i.CEOtype i.B_insiderelated i.CEOtype#i.B_insiderelated

This equation gives me the following:
Mixed-effects ML regression Number of obs = 120

Wald chi2(50) = 128.22
Log likelihood = 66.653405 Prob > chi2 = 0.0000

-----------------------------------------------------------------------------------------
Post_ROA | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------------------+----------------------------------------------------------------
Pre_ROA | .14626 .081553 1.79 0.073 -.013581 .306101
Boardsize | .0117652 .0092706 1.27 0.204 -.0064049 .0299353
LSales | .0145462 .0125622 1.16 0.247 -.0100753 .0391676
Ind_ROAmed | 3.112505 .9180091 3.39 0.001 1.31324 4.91177
1.PC_chairman | .0690523 .0403378 1.71 0.087 -.0100083 .1481129
1.PC_duality | -.0490561 .0360333 -1.36 0.173 -.11968 .0215679
1.Duality | .0558277 .0529226 1.05 0.291 -.0478986 .1595541
|
Year |
2006 | .0544576 .1173241 0.46 0.643 -.1754933 .2844085
2007 | -.0610787 .06111 -1.00 0.318 -.1808521 .0586948
2008 | -.094864 .0467463 -2.03 0.042 -.1864852 -.0032429
2009 | -.143529 .0585118 -2.45 0.014 -.25821 -.028848
|
SIC_2 |
10 | .0561901 .1898293 0.30 0.767 -.3158686 .4282487
13 | -.0201768 .1675557 -0.12 0.904 -.34858 .3082264
20 | -.1821011 .1467214 -1.24 0.215 -.4696698 .1054675
23 | -.3161692 .1953444 -1.62 0.106 -.6990372 .0666989
24 | .1056709 .1814497 0.58 0.560 -.249964 .4613057
25 | -.0633072 .183314 -0.35 0.730 -.422596 .2959816
27 | .0171327 .1970672 0.09 0.931 -.369112 .4033773
28 | .11118 .1171346 0.95 0.343 -.1183995 .3407596
30 | -.3326038 .1525763 -2.18 0.029 -.6316479 -.0335597
34 | -.2628424 .1897255 -1.39 0.166 -.6346975 .1090126
35 | -.7628842 .2835575 -2.69 0.007 -1.318647 -.2071218
36 | -.0483034 .1126936 -0.43 0.668 -.2691788 .172572
37 | -.4590121 .1906657 -2.41 0.016 -.8327101 -.0853141
38 | .1852497 .1236749 1.50 0.134 -.0571487 .427648
42 | -.037945 .2037756 -0.19 0.852 -.4373378 .3614478
49 | -.1105019 .1894212 -0.58 0.560 -.4817607 .2607568
50 | -.1612987 .1739662 -0.93 0.354 -.5022661 .1796687
51 | -.0910764 .1614921 -0.56 0.573 -.4075951 .2254422
53 | -.2060143 .2092402 -0.98 0.325 -.6161176 .2040889
55 | -.027863 .1595689 -0.17 0.861 -.3406124 .2848863
56 | -.1685025 .1526428 -1.10 0.270 -.4676769 .130672
58 | -.116458 .1423203 -0.82 0.413 -.3954006 .1624847
59 | -.3142453 .1636384 -1.92 0.055 -.6349707 .0064801
61 | -.4132618 .2028709 -2.04 0.042 -.8108815 -.0156421
62 | .3634348 .1639658 2.22 0.027 .0420677 .6848018
63 | -.4088123 .2020512 -2.02 0.043 -.8048253 -.0127992
64 | -.0251544 .2110591 -0.12 0.905 -.4388227 .3885139
67 | -.0908212 .1471505 -0.62 0.537 -.3792309 .1975886
72 | .3158626 .159415 1.98 0.048 .0034149 .6283104
73 | -.0463639 .1198774 -0.39 0.699 -.2813192 .1885915
79 | -.1031701 .1389367 -0.74 0.458 -.3754811 .1691408
82 | -.147231 .1853905 -0.79 0.427 -.5105897 .2161277
83 | -.0855206 .1739047 -0.49 0.623 -.4263675 .2553262
87 | .1746057 .1618539 1.08 0.281 -.1426221 .4918334
|
CEOtype |
2 | .0809773 .0602432 1.34 0.179 -.0370973 .1990519
3 | -.0541212 .041248 -1.31 0.189 -.1349658 .0267233
|
1.B_insiderelated | -.0714629 .0574916 -1.24 0.214 -.1841443 .0412186
|
CEOtype#B_insiderelated |
2 1 | -.1494442 .1111639 -1.34 0.179 -.3673215 .0684331
3 1 | .1101797 .0940301 1.17 0.241 -.074116 .2944754
|
_cons | -.3704733 .2018844 -1.84 0.066 -.7661594 .0252129
-----------------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
var(Residual) | .0192785 .0024888 .0149686 .0248292
------------------------------------------------------------------------------

Last edited by Warner de Jong; 30 May 2017, 07:08.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17749
#12

30 May 2017, 07:18

Warner:
-mixed- (see -help mixed-) stays for linear mixed models (mixed model is a synonim for hierachical model).
Inbrief, those models combines a 1-st level fixed effect (that you can estimate via -regress-) with 2-level a random effect, exploiting the nested structure of your data (in your example,, firms are nested in industries).
In your case you would have a fixed effect at the firm level and a random effect at the industry level.
This approach allows each industry to have its own random intercept and, possibly, a random slope, too.
These findings cannot be supported by -regress- which. in general. allows for one intercept only (even though creating different intercepts and slope is feasible under -regress-) and cannot take 2-level variance into account (i.e., the random effect).

Kind regards,
Carlo
(Stata 19.0)
Comment
Warner de Jong

Join Date: May 2017

Posts: 19
#13

30 May 2017, 08:03

Dear Carlo,

I'm sorry but I do not understand what to do now. Atleast, I think I don't. Most of all, I am still uncertain about whether I am actually using the right model.

[EDIT: I edited this post as to better formulate the issues)

Perhaps this is a wrong question, but should I use a hierarchical regression model? or perhaps a 2-way ANOVA? Or should I provide more information about the nature of my predictors, and if so, what could that be?

My second question is about what to do when my normality of residuals is violated, as I cannot use my results in testing my hypotheses. Or does this differ for the hierarchical or mixed model?

My third question is, judging from the graphs and tests that I provided, do I violate the normality assumption? Could I proceed or switch to another model?

PS: The help is very much appreciated, even though I keep asking more questions

Last edited by Warner de Jong; 30 May 2017, 08:27.
Comment
Warner de Jong

Join Date: May 2017

Posts: 19
#14

30 May 2017, 08:29

As for your example, perhaps this could help

Originally posted by Carlo Lazzaro View Post

Warner:
In your case you would have a fixed effect at the firm level and a random effect at the industry level.

CEOs and boards are nested in firms. Firms are nested in industries

EDIT: I have just found a link that helps me in understanding the -mixed- option. The visuals help clarify. I'm going to leave the link here in case it will help future students. http://blog.stata.com/2013/02/04/mul...s-of-variance/

Last edited by Warner de Jong; 30 May 2017, 08:54.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17749
#15

30 May 2017, 08:56

Warnet:
things are trickier than expected, then: see Example 4, -mixed- entry, Stata .pdf manual.
As things stand, you cannot replace a hierachical model with an OLS, as your estimates would be biased.
Two remarks aside of any technicalities:
- be sure that you have enough time to grasp at least the backbones of -mixed- model (Stata would be a good place to start);
- discuss/fine-tune with your teacher/supervisor/professor (who is paid for that) the goal of your research. Statistical analyses can be really tricky to perform and it is easy to end up with wrong results that, even in technical journal, are traded for gold when they are, at best, similar to copper.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

Normality of residuals and heteroskedasticity

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment