Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dealing with missing values on log transformed variables

    I am trying to correct the unbalanced data of my panel dataset due to missing values of certain log transformed variables. According to my prof, the usual trick when dealing with ln of zero is to add 1 to the variable ln(x+1). Should I apply it to the whole dataset or just the affected variables?

  • #2
    You should ask your prof what (s)he advises.

    This can be called a trick but it's also a different transformation with sometimes quite different behaviour. I used to dislike it quite a lot but now think that in context it can sometimes be an idea worth trying. if applied with extreme caution to predictor variables (some still say: independent variables)

    The usual arm-waving defence runs that for large x ln of x and ln of (x + 1) are almost identical.

    Sure, but it is also true that for small positive x

    * ln (x + 1) is always positive while its derivative is never steeper than 1

    * ln x can be arbitrarily large negative and its derivative arbitrarily steep.

    These differences can bite, but how much depends on your data.

    For outcome variables I think you're usually better off using a model with logarithmic link (in generalized linear model jargon).

    But why are you transforming at all? The best reason for transforming would be if

    y = a + b1 T(x1) + whatever


    for a transformation T() is a better specification than

    x = a + b1 x1 + whatever

    i.e. goals of linearity and additivity are more nearly met.

    Comment


    • #3
      As a side note, we had a related discussion not too long ago, see https://www.statalist.org/forums/for...g-using-0-0001
      Best wishes

      (Stata 16.1 MP)

      Comment


      • #4
        It seems like you could substitute zero for ln(zero) and add a dummy variable that was 1 for those cases and 0 otherwise. That wouldn't distort the ln transformation, and would allow records those observations to participate in the regression, provided the pattern of zeroes was sufficiently diverse to allow estimation of the dummies.

        I have also seen the suggestion to use a cube root transformation instead of ln. That also tames large values.

        Comment


        • #5
          #4 is a device I've often seen recommended. I just note that if there are values in (0, 1) then their logarithms are defined and negative and (that being so) it seems a bit awkward to treat values of 0 as if they were greater than those values. I suppose the argument is that the indicator (you say dummy) variable for being zero takes care of that.

          If the variable in question is counted with possible values zero or positive integers, the device is more elegant.

          The thread linked to in #3 mentions neglog and asinh. as alternatives and cube root is often mentioned in these threads. In this contexts we should add that square roots are also defined for zero arguments.
          Last edited by Nick Cox; 05 Jan 2022, 08:32.

          Comment

          Working...
          X