Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Normality of residuals and heteroskedasticity

    Dear forum,

    I am checking the assumptions for using a multiple regression model. The dependent is a continuous variable. The independent variables are both continuous and dummy variables. Concerning the assumptions. I already checked for outliers. Yet, I am experiencing difficulty with the other assumptions. Perhaps I should use a different model?

    When checking for homoskedasticity using the "estat hettest" and "estat imtest, white" commands, I got very different results. The hettest shows that heteroskedasticity is present whereas the imtest, white doest not. The results confuse me about how to continue with my model. Furthermore, I had checked for the normality of the residuals using an sktest and found that my residuals are not normally distributed either. The dependent variable is however close to a normal distribution. if that may help.

    Thank you for your time,
    Warner

    . estat hettest

    Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
    Ho: Constant variance
    Variables: fitted values of Post_ROA

    chi2(1) = 13.94
    Prob > chi2 = 0.0002

    . estat imtest, white

    White's test for Ho: homoskedasticity
    against Ha: unrestricted heteroskedasticity

    chi2(20) = 20.23
    Prob > chi2 = 0.4439

    Cameron & Trivedi's decomposition of IM-test

    ---------------------------------------------------
    Source | chi2 df p
    ---------------------+-----------------------------
    Heteroskedasticity | 20.23 20 0.4439
    Skewness | 8.98 5 0.1099
    Kurtosis | 5.87 1 0.0154
    ---------------------+-----------------------------
    Total | 35.08 26 0.1100
    ---------------------------------------------------


  • #2
    Looking at residuals is of some use but you're asking for advice on a model that you don't show us. Is it a plain regression or something else? What is the sample size and how many parameters are you estimating? Do all the predictors look good?

    It is unfortunate that many texts and courses seem to counsel focus on such tests when whether Y = Xb is a suitable structure is the most important question of all and residual plots, including added variable plots, the most valuable diagnostics.

    Comment


    • #3
      Warner:
      I do share Nick's comments and I would add that it is also unfortunate that most courses underlines the prererquisite of normality for dependent variable.
      As an aside, the postestimation test you reported use a different number of parameters: hence, this feature can explain why you got different results.
      Eventually, if the results of a visual inspection of your residual distribution worry you, you can robustify your standard error and go on with -regression-..
      Heteroskedasticity per se is seldom a worrisome nuisance: instead, you should rule out that heteroskedasticity is not a warning light for omitted variable bias (or better, non-linearity of the relationship between a given predictor and the dependent variable), which is absolutely more catastrophic for your estimates (eg, endogeneity).
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        Dear Nick and Carlo,

        My apologies for not defining my model properly. I'll try to do so now.

        I compiled a dataset of around 130 CEO successions of about 120 different companies. I'm wanting to test the relationship between post-succession ROA and CEO type in moderation of board composition. So, in short a moderation relationship.

        DV: post-succession ROA (continuous) as a 3-year average
        IV1: CEO type (nominal with 3 types)
        IV2: Board composition (nominal with 2 types)

        I have several other control variables,
        - If the previous CEO was also chairman (dummy)
        - if the previous is chairman now (dummy)
        - if the current CEO is chairman (dummy)
        - Year (nominal)
        - Industry SIC 2-digit (nominal)
        - Board size (continuous)
        - pre-succession ROA (continuous) as a 3-year average
        - Industry ROA (continuous as a 3-year average.


        The only reason I thought that normality of residuals would be important is because I am testing a variety of hypotheses. I read that without normality of residuals I cannot do hypothesis testing. As such, I am looking for a solution.

        About the hypotheses. The first set of hypotheses looks at whether each CEO type (3 types) affect firm performance (ROA)
        The second set looks at the interaction effect of board composition on each CEO type to firm performance (ROA)

        Up till now, I tried a formula where first all the continuous control variables are put into the equation, then the nominal controls, then each IV is added separately to the equation (to infer their effects) and lastly the interaction variables. As a whole, it looks like this:

        1. reg Post_ROA c.Pre_ROA c.Boardsize c.Logsales c..Industry_ROA
        2. reg Post_ROA c.Pre_ROA c.Boardsize c.Logsales c..Industry_ROA i.PrevCEOischair i.PrevCEOduality i.CEOduality i.year i.SIC2
        3. reg Post_ROA c.Pre_ROA c.Boardsize c.Logsales c..Industry_ROA i.PrevCEOischair i.PrevCEOduality i.CEOduality i.year i.SIC2 i.CTYPE
        4. reg Post_ROA c.Pre_ROA c.Boardsize c.Logsales c..Industry_ROA i.PrevCEOischair i.PrevCEOduality i.CEOduality i.year i.SIC2 i.CTYPE i.BTYPE
        5. reg Post_ROA c.Pre_ROA c.Boardsize c.Logsales c..Industry_ROA i.PrevCEOischair i.PrevCEOduality i.CEOduality i.year i.SIC2 i.CTYPE i.BTYPE i.CTYPE#i.BTYPE

        I first input the continuous variables as when checking for the assumptions of linearity, I only needed to look at the continuous variables as linearity for nominal variables is automatically fulfilled. Alteast, I read that this was the case.

        Hope this helps to clarify


        Comment


        • #5
          Originally posted by Nick Cox View Post
          Do all the predictors look good?.
          Does this mean that my variables significantly correlate? I do not understand correctly what makes a good predictor.

          Originally posted by Carlo Lazzaro View Post
          if the results of a visual inspection of your residual distribution worry you, you can robustify your standard error and go on with -regression-..
          I looked at an rvfplot of the residuals and see that they follow a certain band distribution. So, because of the result of the hettest, I should now robust my standard errors? Does this mean that I can still test my hypotheses? If not, should I change towards a different statistical model?

          Originally posted by Carlo Lazzaro View Post
          you should rule out that heteroskedasticity is not a warning light for omitted variable bias (or better, non-linearity of the relationship between a given predictor and the dependent variable), which is absolutely more catastrophic for your estimates (eg, endogeneity).
          If I understand correctly, omitted variable bias is more catastrophic for my estimates? Because board composition has an endogeneity problem. As boards influence the performance of a firm (ROA), the performance of a firm influences the future composition of the board.
          Last edited by Warner de Jong; 30 May 2017, 03:53.

          Comment


          • #6
            Warner:
            my previous remark about the pointless of normality referred to the -depvar- (as you stated that
            The dependent variable is however close to a normal distribution
            ).
            Conversely, it is wise to visually inspecting te residual distribution.
            I do not follow your approach of including continuous variables first and then going on as you described.
            Regression models should give a fair and true view of the data generating process underlying the population from which your sample has been drawn. Conversely, you seem (as it is often the case) to hunt for the "best" (whatever that means) regression model. Rather, I would recommend you to look at the literature in your research field and see what others did in the past when prsented with the same research topic.
            Please note that the effect of each predictor is adjusted for the other ones; put differently, it is hard to desentagle the effect of each independent variable precisely.That said, you can consider Adj-R squared to avoid including inefficient predictors.
            Eventually (provided that this is not may resarch field), I would check whether your regression models suffers from endogeneity: for instance, does CEO ability (in doing business; creating relationships or the like) influence at the same time, -CEO_type- and/or -Log_sales- and -Post_ROA- ( I assume that ROA stays for return on (net) actvities, if what I learnt in the past millennium still holds)?
            As a coding-related aside -c..Industry_ROA- shoud be -c.Industry_ROA-.

            PS: crossed in the cyberspace with Warner's reply, who is wisely wondering whether his regression models suffer from endogeneity.
            Last edited by Carlo Lazzaro; 30 May 2017, 04:02.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Dear Carlo,

              It might be true that I'm hunting for the best regression model. I'm following the structure of a paper that also uses CEO type, yet a different interaction variable. The authors have used hierarchical multiple linear regression. They have first made a control model and then added models that included the IVs and interaction. Seeing as that I'm still learning about statistics and that this paper has been cited very often, I opted for copying their hierarchical MLR approach. What I am really looking for is to have a resultant table that would show me a model per column in which the sign, size and significance of variables in my regression is shown. I would then assess the models based on their adjusted r2, AIC and BIC, and move to refusing or not refusing my hypotheses.

              ROA means return on assets. It is a operational performance indicator calculated as the net income of a firm divided by its total assets

              With the dependent variable being close to a normal distribution i mean that Post_ROA histogram looks bell-shaped yet has a little tail to its left.

              Concerning the residuals I would post an image of the graphs but I think that's not allowed right? (edit: please see below) I regress:

              5. reg Post_ROA c.Pre_ROA c.Boardsize c.Logsales c..Industry_ROA i.PrevCEOischair i.PrevCEOduality i.CEOduality i.year i.SIC2 i.CTYPE i.BTYPE i.CTYPE#i.BTYPE

              I get the following:

              . sktest r

              Skewness/Kurtosis tests for Normality
              ------ joint ------
              Variable | Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2
              ---------+---------------------------------------------------------------
              r | 120 0.0035 0.0064 13.23 0.0013


              swilk r

              Shapiro-Wilk W test for normal data

              Variable | Obs W V z Prob>z
              ----------------+------------------------------------------------------
              r | 120 0.94935 4.874 3.548 0.00019



              Because the result of the Shapiro-Wilk test (swilk, r) gives a probability of .00019, and my sktest gives 0.0013 I infer that my residuals are not normally distributed.

              Furthermore, if I plot my DV against the residuals of the model, I get a thick diagonal line which start at the bottom left and moves to the bottom right. Does this infer anything about my model? I used the following command

              .scatter Post_ROA r
              Last edited by Warner de Jong; 30 May 2017, 06:09.

              Comment


              • #8
                Graphs may and should be shown as .png attachments.

                Comment


                • #9
                  Thank you Nick.
                  Click image for larger version

Name:	ksdensity residuals full mode graph.png
Views:	1
Size:	52.1 KB
ID:	1395426
                  Click image for larger version

Name:	pnorm residuals full mode graph.png
Views:	1
Size:	46.7 KB
ID:	1395427Click image for larger version

Name:	qnorm residuals full mode graph.png
Views:	1
Size:	40.7 KB
ID:	1395428

                  scatter Post_ROA r
                  Click image for larger version

Name:	scatter PostROA and residuals full mode Graph.png
Views:	1
Size:	46.5 KB
ID:	1395432
                  Last edited by Warner de Jong; 30 May 2017, 06:11.

                  Comment


                  • #10
                    Warner:
                    - you're right in correcting me about ROA (-attività- is the Italian word for -assets- and I stumbled upon a bad translation);
                    - you're taking about a hierachical model. This is something you cannot achieve via -regress-; see -mixed- instead;
                    - I do not see any substantive pattern in your residuals, set aside a nasty behaviour of the distribution around the tails.
                    Kind regards,
                    Carlo
                    (Stata 19.0)

                    Comment


                    • #11
                      Carlo:

                      Could you tell me what -mixed- means? and why should I use that instead of -regress-?

                      do you mean it like this?
                      . Mixed Post_ROA Pre_ROA Boardsize LSales Ind_ROAmed i.PC_chairman i.PC_duality i.Duality i.Year i.SIC_2 i.CEOtype i.B_insiderelated i.CEOtype#i.B_insiderelated

                      This equation gives me the following:
                      Mixed-effects ML regression Number of obs = 120

                      Wald chi2(50) = 128.22
                      Log likelihood = 66.653405 Prob > chi2 = 0.0000


                      -----------------------------------------------------------------------------------------
                      Post_ROA | Coef. Std. Err. z P>|z| [95% Conf. Interval]
                      ------------------------+----------------------------------------------------------------
                      Pre_ROA | .14626 .081553 1.79 0.073 -.013581 .306101
                      Boardsize | .0117652 .0092706 1.27 0.204 -.0064049 .0299353
                      LSales | .0145462 .0125622 1.16 0.247 -.0100753 .0391676
                      Ind_ROAmed | 3.112505 .9180091 3.39 0.001 1.31324 4.91177
                      1.PC_chairman | .0690523 .0403378 1.71 0.087 -.0100083 .1481129
                      1.PC_duality | -.0490561 .0360333 -1.36 0.173 -.11968 .0215679
                      1.Duality | .0558277 .0529226 1.05 0.291 -.0478986 .1595541
                      |
                      Year |
                      2006 | .0544576 .1173241 0.46 0.643 -.1754933 .2844085
                      2007 | -.0610787 .06111 -1.00 0.318 -.1808521 .0586948
                      2008 | -.094864 .0467463 -2.03 0.042 -.1864852 -.0032429
                      2009 | -.143529 .0585118 -2.45 0.014 -.25821 -.028848
                      |
                      SIC_2 |
                      10 | .0561901 .1898293 0.30 0.767 -.3158686 .4282487
                      13 | -.0201768 .1675557 -0.12 0.904 -.34858 .3082264
                      20 | -.1821011 .1467214 -1.24 0.215 -.4696698 .1054675
                      23 | -.3161692 .1953444 -1.62 0.106 -.6990372 .0666989
                      24 | .1056709 .1814497 0.58 0.560 -.249964 .4613057
                      25 | -.0633072 .183314 -0.35 0.730 -.422596 .2959816
                      27 | .0171327 .1970672 0.09 0.931 -.369112 .4033773
                      28 | .11118 .1171346 0.95 0.343 -.1183995 .3407596
                      30 | -.3326038 .1525763 -2.18 0.029 -.6316479 -.0335597
                      34 | -.2628424 .1897255 -1.39 0.166 -.6346975 .1090126
                      35 | -.7628842 .2835575 -2.69 0.007 -1.318647 -.2071218
                      36 | -.0483034 .1126936 -0.43 0.668 -.2691788 .172572
                      37 | -.4590121 .1906657 -2.41 0.016 -.8327101 -.0853141
                      38 | .1852497 .1236749 1.50 0.134 -.0571487 .427648
                      42 | -.037945 .2037756 -0.19 0.852 -.4373378 .3614478
                      49 | -.1105019 .1894212 -0.58 0.560 -.4817607 .2607568
                      50 | -.1612987 .1739662 -0.93 0.354 -.5022661 .1796687
                      51 | -.0910764 .1614921 -0.56 0.573 -.4075951 .2254422
                      53 | -.2060143 .2092402 -0.98 0.325 -.6161176 .2040889
                      55 | -.027863 .1595689 -0.17 0.861 -.3406124 .2848863
                      56 | -.1685025 .1526428 -1.10 0.270 -.4676769 .130672
                      58 | -.116458 .1423203 -0.82 0.413 -.3954006 .1624847
                      59 | -.3142453 .1636384 -1.92 0.055 -.6349707 .0064801
                      61 | -.4132618 .2028709 -2.04 0.042 -.8108815 -.0156421
                      62 | .3634348 .1639658 2.22 0.027 .0420677 .6848018
                      63 | -.4088123 .2020512 -2.02 0.043 -.8048253 -.0127992
                      64 | -.0251544 .2110591 -0.12 0.905 -.4388227 .3885139
                      67 | -.0908212 .1471505 -0.62 0.537 -.3792309 .1975886
                      72 | .3158626 .159415 1.98 0.048 .0034149 .6283104
                      73 | -.0463639 .1198774 -0.39 0.699 -.2813192 .1885915
                      79 | -.1031701 .1389367 -0.74 0.458 -.3754811 .1691408
                      82 | -.147231 .1853905 -0.79 0.427 -.5105897 .2161277
                      83 | -.0855206 .1739047 -0.49 0.623 -.4263675 .2553262
                      87 | .1746057 .1618539 1.08 0.281 -.1426221 .4918334
                      |
                      CEOtype |
                      2 | .0809773 .0602432 1.34 0.179 -.0370973 .1990519
                      3 | -.0541212 .041248 -1.31 0.189 -.1349658 .0267233
                      |
                      1.B_insiderelated | -.0714629 .0574916 -1.24 0.214 -.1841443 .0412186
                      |
                      CEOtype#B_insiderelated |
                      2 1 | -.1494442 .1111639 -1.34 0.179 -.3673215 .0684331
                      3 1 | .1101797 .0940301 1.17 0.241 -.074116 .2944754
                      |
                      _cons | -.3704733 .2018844 -1.84 0.066 -.7661594 .0252129
                      -----------------------------------------------------------------------------------------

                      ------------------------------------------------------------------------------
                      Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
                      -----------------------------+------------------------------------------------
                      var(Residual) | .0192785 .0024888 .0149686 .0248292
                      ------------------------------------------------------------------------------


                      Last edited by Warner de Jong; 30 May 2017, 07:08.

                      Comment


                      • #12
                        Warner:
                        -mixed- (see -help mixed-) stays for linear mixed models (mixed model is a synonim for hierachical model).
                        Inbrief, those models combines a 1-st level fixed effect (that you can estimate via -regress-) with 2-level a random effect, exploiting the nested structure of your data (in your example,, firms are nested in industries).
                        In your case you would have a fixed effect at the firm level and a random effect at the industry level.
                        This approach allows each industry to have its own random intercept and, possibly, a random slope, too.
                        These findings cannot be supported by -regress- which. in general. allows for one intercept only (even though creating different intercepts and slope is feasible under -regress-) and cannot take 2-level variance into account (i.e., the random effect).
                        Kind regards,
                        Carlo
                        (Stata 19.0)

                        Comment


                        • #13
                          Dear Carlo,

                          I'm sorry but I do not understand what to do now. Atleast, I think I don't. Most of all, I am still uncertain about whether I am actually using the right model.

                          [EDIT: I edited this post as to better formulate the issues)

                          Perhaps this is a wrong question, but should I use a hierarchical regression model? or perhaps a 2-way ANOVA? Or should I provide more information about the nature of my predictors, and if so, what could that be?

                          My second question is about what to do when my normality of residuals is violated, as I cannot use my results in testing my hypotheses. Or does this differ for the hierarchical or mixed model?

                          My third question is, judging from the graphs and tests that I provided, do I violate the normality assumption? Could I proceed or switch to another model?

                          PS: The help is very much appreciated, even though I keep asking more questions
                          Last edited by Warner de Jong; 30 May 2017, 08:27.

                          Comment


                          • #14
                            As for your example, perhaps this could help

                            Originally posted by Carlo Lazzaro View Post
                            Warner:
                            In your case you would have a fixed effect at the firm level and a random effect at the industry level.
                            CEOs and boards are nested in firms. Firms are nested in industries

                            EDIT: I have just found a link that helps me in understanding the -mixed- option. The visuals help clarify. I'm going to leave the link here in case it will help future students. http://blog.stata.com/2013/02/04/mul...s-of-variance/
                            Last edited by Warner de Jong; 30 May 2017, 08:54.

                            Comment


                            • #15
                              Warnet:
                              things are trickier than expected, then: see Example 4, -mixed- entry, Stata .pdf manual.
                              As things stand, you cannot replace a hierachical model with an OLS, as your estimates would be biased.
                              Two remarks aside of any technicalities:
                              - be sure that you have enough time to grasp at least the backbones of -mixed- model (Stata would be a good place to start);
                              - discuss/fine-tune with your teacher/supervisor/professor (who is paid for that) the goal of your research. Statistical analyses can be really tricky to perform and it is easy to end up with wrong results that, even in technical journal, are traded for gold when they are, at best, similar to copper.
                              Kind regards,
                              Carlo
                              (Stata 19.0)

                              Comment

                              Working...
                              X