Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to model a non-normal variable

    Hello all,

    New Stata user here. I am having some trouble modelling a variable that is key to answering my research question for my master's thesis. I am looking to create a multivariable linear regression model, where my dependent variable is a summed score of depression symptoms.

    The problem is that my dependent variable does not appear to be normally distributed (Skewness=0.801527; Kurtosis= 3.196693). The distribution has a positive skew, and this is expected, as we assume that very high depression scores are not common in the general population.

    My question is in regards to the best way to transform this variable so that I can use it in a linear regression model. I have tried most common transformations that I can think of (log, ln, e) and these actually make it all worse. I would appreciate any help on this issue!

    Thanks so much,

    Jen

  • #2
    My advice is: don't. It is a myth that the dependent variable in a linear regression has to have a normal distribution. What is closer to true is that the residuals of the regression should be normally distributed. That is actually true in order for the F-statistics and t-statistics to actually have F- and t- sampling distributions, so that the p-values are "exact." But, it has long been known that these things are quite robust to even substantial departures from normality. With skewness 0.8 and kurtosis of about 3, I think you are likely to be just fine.

    And, better still, if your sample is reasonably large, the central limit theorem will give the coefficients an asymptotically normal distribution regardless of what the residuals look like, so that the t-statistics, etc. are all appropriate anyway.

    So, unless you have a really small sample, you really don't have to worry about it. I think that a sample small enough for non-normality to be a problem would, for other reasons, be considered inadequate for a master's thesis any way.

    Comment


    • #3
      Just as a side note, considering you potentially had "more extreme" conditions, you could think about quantile regression. But that wouldn't make wonders in case the sample has a small size anyway.
      Best regards,

      Marcos

      Comment


      • #4
        In addition to other advice I would note that a summed score is presumably bounded. In those circumstances I would consider a logit link or other link to ensure that predictions remain in bounds. In your case, your mean is nearer the lower bound, but linearity would itself not ensure within-bounds predictions. Whether that bites within the range of your dataset we can't tell. This is a bigger deal than normality of any marginal distribution, which is not an assumption here.

        Comment

        Working...
        X