Dealing with missing values on log transformed variables

Patricia Nicole

Join Date: Jan 2022

Posts: 3
#1

Dealing with missing values on log transformed variables

05 Jan 2022, 04:26

I am trying to correct the unbalanced data of my panel dataset due to missing values of certain log transformed variables. According to my prof, the usual trick when dealing with ln of zero is to add 1 to the variable ln(x+1). Should I apply it to the whole dataset or just the affected variables?
Tags: data, panel data, regression
Nick Cox

Join Date: Mar 2014

Posts: 35696
#2

05 Jan 2022, 05:02

You should ask your prof what (s)he advises.

This can be called a trick but it's also a different transformation with sometimes quite different behaviour. I used to dislike it quite a lot but now think that in context it can sometimes be an idea worth trying. if applied with extreme caution to predictor variables (some still say: independent variables)

The usual arm-waving defence runs that for large x ln of x and ln of (x + 1) are almost identical.

Sure, but it is also true that for small positive x

* ln (x + 1) is always positive while its derivative is never steeper than 1

* ln x can be arbitrarily large negative and its derivative arbitrarily steep.

These differences can bite, but how much depends on your data.

For outcome variables I think you're usually better off using a model with logarithmic link (in generalized linear model jargon).

But why are you transforming at all? The best reason for transforming would be if

y = a + b1 T(x1) + whatever

for a transformation T() is a better specification than

x = a + b1 x1 + whatever

i.e. goals of linearity and additivity are more nearly met.
2 likes
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 693
#3

05 Jan 2022, 05:13

As a side note, we had a related discussion not too long ago, see https://www.statalist.org/forums/for...g-using-0-0001

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
1 like
Comment
Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#4

05 Jan 2022, 06:06

It seems like you could substitute zero for ln(zero) and add a dummy variable that was 1 for those cases and 0 otherwise. That wouldn't distort the ln transformation, and would allow records those observations to participate in the regression, provided the pattern of zeroes was sufficiently diverse to allow estimation of the dummies.

I have also seen the suggestion to use a cube root transformation instead of ln. That also tames large values.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#5

05 Jan 2022, 07:27

#4 is a device I've often seen recommended. I just note that if there are values in (0, 1) then their logarithms are defined and negative and (that being so) it seems a bit awkward to treat values of 0 as if they were greater than those values. I suppose the argument is that the indicator (you say dummy) variable for being zero takes care of that.

If the variable in question is counted with possible values zero or positive integers, the device is more elegant.

The thread linked to in #3 mentions neglog and asinh. as alternatives and cube root is often mentioned in these threads. In this contexts we should add that square roots are also defined for zero arguments.

Last edited by Nick Cox; 05 Jan 2022, 07:32.
Comment

Announcement

Dealing with missing values on log transformed variables

Comment

Comment

Comment

Comment