Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Would it be problematic to log my dependent variable? It takes values only between 0 and 1

    Hi all, I am estimating a fixed effects model and the dependent variable in question is Human Development Index (HDI), which is always between 0 & 1. I want to have a log level model in stata, but I'm now curious- won't the log transformation muck up the model? Would it maybe be appropriate to +1 to all my data points in HDI to keep them above 1? Or should I just go for a linear model instead?

  • #2
    Tetenda:
    welcome to this forum.
    1) clearly, -ln(0)- produces missing values; hence, the trivial question is: how many observations report 0?
    2) adding a small constant in the attempt to fix the issue is not recommended;
    3) you do not say if the distribution of your regressand is positively skewed (provided 1), logging reduces the skewness) or negatively skewed ( logging worsens the situation).
    Personally, with a DV with many zeros, I would stick with a linear-linear model.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      I don't recollect, last time I used such data, that any country has HDI zero.

      But if you're considering a logarithmic transformation you do need reasons why it is a good idea.

      log(HDI + 1) is a poor idea if only because it doesn't make much difference, as a direct graph will show you:

      Code:
      twoway function log(x + 1), range(0 1)
      (We can add that it choosing 1 as a constant is here highly arbitrary, as Carlo Lazzaro perhaps hints in #2.)

      Without knowing how much difference it makes, my inclination is to take HDI as it arrives.

      Comment


      • #4
        Others who are unfamiliar with the Human Development Index (as I was) may find this page helpful. Don't miss the link to the technical notes.
        --
        Bruce Weaver
        Email: [email protected]
        Version: Stata/MP 18.5 (Windows)

        Comment


        • #5
          Originally posted by Carlo Lazzaro View Post
          Tetenda:
          welcome to this forum.
          1) clearly, -ln(0)- produces missing values; hence, the trivial question is: how many observations report 0?
          2) adding a small constant in the attempt to fix the issue is not recommended;
          3) you do not say if the distribution of your regressand is positively skewed (provided 1), logging reduces the skewness) or negatively skewed ( logging worsens the situation).
          Personally, with a DV with many zeros, I would stick with a linear-linear model.
          It's not that some observations are 0, but that they are between 0-1, and the log of these is negative. I'm not quite sure what you mean by bullet point 3?

          Comment


          • #6
            Tatenda:
            I meant that, if the distribution of your regressand is negatively skewed, its shape (in terms of departing from the gaussian) is worsened by -ln()- transformation.
            The opposite is expected if the distribution of your regressand is positively skewed (as it is frequent with total cost distributions, that usually have a long right-tail, that proves their positive skewness).
            Kind regards,
            Carlo
            (StataNow 18.5)

            Comment


            • #7
              It's never a problem if logarithms are negative.

              In any case, predicted values could (arguably, should) be reported on the original scale, as should axis labels on graphs etc. even if you use logarithmic scales.

              But here is a graph based on 2020 HDI. The curve shows log HDI against HDI and is over the observed range from Norway downwards. The y axis is labelled in terms of HDI using mylabels from SSC.

              Click image for larger version

Name:	HDI.png
Views:	1
Size:	20.5 KB
ID:	1662844


              There is, I guess, not nearly enough curvature here to make a big difference, but if there is a good argument for using logarithms otherwise, then go with it.

              Comment


              • #8
                You might consider sidestepping the problem of linear v. log by going to an ordered probit model. There is a test for choosing the appropriate model (or an intermediate form) but the name and details escape me at this moment.

                Comment


                • #9
                  [email protected] So each distinct value of HDI defines a separate level? I've seen that kind of model being advocated elsewhere.

                  Comment

                  Working...
                  X