Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dealing with skewed data (with values smaller than 1)

    I want to perform a panel data analysis. I have a dependent variable with values between 0 and 0.61. The 2,360 observations are for ~130 countries and take place between 2003-2020.
    The Skewness of the data stands at 3.075139. Meaning the data is not normally distributed. What is the best way, in your opinion, to deal with this violation of assumption for a regression analysis?
    I removed observations from the 99 percentile, and the Skewness decreased to 1.3. The results of the regression stayed roughly the same.
    I am, however, not sure if observation deletion is justified. Should I prefer some other way to normalize the data (log+1transformation, root transformation, etc.)?
    If yes, what method would you recommend?
    Last edited by Dan Eran; 02 Nov 2021, 13:03.

  • #2
    It's not an assumption of regression that the outcome variable is normally distributed. I am genuinely curious on where you got any idea to the contrary. It's not mentioned as an assumption in any competent text or course, because it isn't one.

    What to do depends sensitively on what the variable is. For example, it may be something with an upper limit of 1.

    There isn't advice here that's independent of what the response is, except that removing high values makes no sense unless you know that they are all wrong. It's like throwing a basketball team off a plane because they won't fit easily in the seats. The problem there is the plane, not the basketball players.

    Comment


    • #3
      Thank you for your response.
      As for your curiosity - it was just my rusty memory. For some reason, I thought that was an assumption. I also know that trade economist log transforms the data because it is highly skewed and deletes zero trade observations (although ppml helps to overcome the issue).
      So, are you telling me I can proceed with the analysis even though my data has a skewness of ~3? It is important to note that the variable could have a maximum value of one, but in practice, never exceeds 0.61.
      Many thanks,
      Dan

      Comment


      • #4
        I'd go for a logit link in a generalized linear model with binomial family and robust standard errors. The arguments go back to https://academic.oup.com/biomet/arti...1/3/439/249095 at least. That way key advantages are

        1. You don't need to transform the zeros, or fudge them otherwise. The model says that the mean outcome is positive, which is compatible with some values being zero.

        2. The predicted values will be within range. With your data, there is serious risk of predicting negative outcomes and minor risk of predicting outcomes greater than 1.

        3. The variance properties will make more sense. As your mean goes to 0, so does the variance because the only way to get mean zero is that all the values are zero. Similarly variance is likely to be higher around 0.5. So you won't have homoscedasticity of errors, which is an assumption often made in classic linear regression.

        That said, if you have a frequency spike at zero because there is a sizable group who never do whatever it is, you might need something designed for that.

        Comment


        • #5
          Thank you very much.

          Comment


          • #6
            Originally posted by Nick Cox View Post
            It's not an assumption of regression that the outcome variable is normally distributed. I am genuinely curious on where you got any idea to the contrary. It's not mentioned as an assumption in any competent text or course, because it isn't one.
            There are detailed explanations: https://data.library.virginia.edu/normality-assumption/
            When I first learned data analysis, I always checked normality for each variable and made sure they were normally distributed before running any analyses, such as t-test, ANOVA, or linear regression. I thought normal distribution of variables was the important assumption to proceed to analyses. That’s why stats textbooks show you how to draw histograms and QQ-plots in the beginning of data analysis in the early chapters and see if they’re normally distributed, isn’t it? There I was, drawing histograms, looking at the shape and thinking, “Oh, no, my data are not normal. I should transform them first or I can’t run any analyses.”

            No, you don’t have to transform your observed variables just because they don’t follow a normal distribution. Linear regression analysis, which includes t-test and ANOVA, does not assume normality for either predictors (IV) or an outcome (DV).

            No way! When I learned regression analysis, I remember my stats professor said we should check normality!

            Yes, you should check normality of errors AFTER modeling.
            And actually checking normality of errors seems not (very) necessary too: https://stats.stackexchange.com/ques...e-purpose-of-e
            The regression assumption that is generally least important is that the errors are normally distributed. In fact, for the purpose of estimating the regression line (as compared to predicting individual data points), the assumption of normality is barely important at all. Thus, in contrast to many regression textbooks, we do not recommend diagnostics of the normality of regression residuals.
            Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press

            Comment


            • #7
              The quotations in #6 are interesting, but nothing is simple here. For example, it is a really good idea to look at the distributions of variables early on so that modelling choices are well informed. For one, nothing shows up data quality problems so well as some good graphs. For another, very skewed distributions or long-tailed distributions may well have implications in terms of what model(s) you try and whether (e.g.) transformation of predictors may help.

              Some second or third courses in various fields seem to leave behind whatever was taught in a statistics first course in an apparent belief that macho master's statistics is all about modelling. Right, and wrong. The results are often what people deserve, a model that ignores or mishandles important features in the data.

              Comment


              • #8
                Yes, Exploratory Data Analysis is first and basic: (also from the first citation link in #6) https://data.library.virginia.edu/normality-assumption/
                Okay, I understand my variables don’t have to be normal. Why do we even bother checking histogram before analysis then?

                Although your data don’t have to be normal, it’s still a good idea to check data distributions just to understand your data. Do they look reasonable? Your data might not be normal for a reason. Is it count data or reaction time? In such cases, you may want to transform it or use other analysis methods (e.g., generalized linear models or nonparametric methods). The relationship between two variables may also be non-linear (which you might detect with a scatterplot). In that case transforming one or both variables may be necessary.

                Comment

                Working...
                X