Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • What to do if residuals are not normally distributed?

    We have been reading comments on what to do if your residuals are not normally distributed, like they are in our project. However we have already proven our data is homoscedastic so we are unsure on whether to use the robust function on our regression is correct. Is there anything else we can do to eliminate the issue? We have also read that in large samples OLS estimators are asymptotically normally distributed and the f and t statistics remain valid. Our data sample is over 6000 observations, would that be the case for us? All help very much appreciated.

  • #2
    Tom:
    welcome to this forum.
    1) if your residual distribution is homoskedastic, desault standard errors make sense;
    2) in addition, you may want to check your residual for serial correlation. If there's evidence of that, just use -vce(cluster clusterid)-.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      If we have proven we do not have heteroscedasticity can we assume our residuals are normal?

      Comment


      • #4
        Tom:
        normality is a (weak) requirement of residual distribution.
        With 6000 sample size even samll departure from normality would make the test statistically significant.
        That said, I'd no worry about thi issue.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Originally posted by Carlo Lazzaro View Post
          With 6000 sample size even samll departure from normality would make the test statistically significant.
          Hi Carlo Lazzaro. To be clear, you are talking about a statistical test of normality there, right? IMO, statistical tests of normality as precursors to models that assume normal errors are pretty much useless, because as you have said, they detect small inconsequential departures from normality when n is large, and they fail to detect important departures from normality when n is small.

          I may have missed it, but I did not think Tom Hetherington was asking about testing for normality. I think he was asking someone to confirm that n=6000 is large enough to assume that the sampling distributions of the parameter estimates will be near enough to normal so that he can go ahead and use his OLS model. I would guess that it likely is large enough. But I also note that reputable authors are usually reluctant to suggest an exact n that is large enough. E.g., here is an excerpt from Wooldridge's (2021) Introductory Econometrics.

          In addition to finite sample properties, it is important to know the asymptotic properties or large sample properties of estimators and test statistics. These properties are not defined for a particular sample size; rather, they are defined as the sample size grows without bound. Fortunately, under the assumptions we have made, OLS has satisfactory large sample properties. One practically important finding is that even without the normality assumption (Assumption MLR.6), t and F statistics have approximately t and F distributions, at least in large sample sizes. (p. 168, emphasis added)
          And for good measure, here is an excerpt from the book by Vittinghoff et al. (2012):

          In Sect. 4.1, we stated that in the multipredictor linear model, the error term ε is assumed to have a normal distribution. Confidence intervals for the regression coefficients and related hypothesis tests are based on the assumption that the coefficient estimates have a normal [sampling] distribution. If ε has a normal distribution, and other assumptions of the multipredictor linear model are met, then ordinary least squares estimates of the regression coefficients can be shown to have a normal distribution as required.

          However, it can be shown that the regression coefficients are approximately normal in larger samples even if ε does not have a normal distribution. (emphasis added)
          Notice that neither of these excerpts says how large n has to be to be large enough. But if we ask Jeff Wooldridge, maybe he'll offer an opinion about your n = 6000.

          HTH.

          References

          Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE. Regression methods in biostatistics: linear, logistic, survival, and repeated measures models (2nd Ed.). Springer Science & Business Media; 2012.

          Wooldridge, JM. Introductory Econometrics: A Modern Approach (5th Ed.). Boston: Cengage Learning; 2012.
          --
          Bruce Weaver
          Email: [email protected]
          Version: Stata/MP 18.5 (Windows)

          Comment


          • #6
            Bruce:
            I do like your detailed explanation and I do agree with you about the irrelevance of normality test for residual distribution.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              The fine comments here miss one issue of interest, how to check whether the model -- even though apparently good -- can be modified to make it better. Here I like to check a plot of residual versus fitted. I don't know why such diagnostic plots are not used more.

              Comment

              Working...
              X