Log transformation of 0 is not working (using +0.0001)

Gustav Egede Hansen

Join Date: May 2021
Posts: 94

Log transformation of 0 is not working (using +0.0001)

09 Nov 2021, 08:14

Hi everybody

I am looking to log transform a variable that measures salary. However, some observations are 0, which makes the log transformation erroneous. I have read that you can add 0.0001 to all observations and then transform. However, it is not working as values of 0.0001 become missing. My goal is to calculate the percentage change over time for a treatment and control group.

Here is some code and an example dataset:

Code:

sort id time
replace salary = 0 if id == 1 & time == 99 // just to create the issue that I am facing in my main dataset
gen  ln1_salary = salary+0.00001
bro id tim salary ln1_salary
replace ln1_salary = ln(salary)

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(id time treatment salary gender)
 8 103 0 40000 0
 8  99 0 40000 0
17  97 1 40000 0
 9  98 0 40000 0
18 103 1 35000 0
18  98 1 40000 0
14  99 0 40000 0
17  99 1 40000 0
12  97 1 40000 0
 5 100 1 40000 0
12  98 1 40000 0
14 101 0 40000 0
14  98 0 40000 0
18 100 1 40000 0
17  98 1 40000 0
 9 101 0 40000 0
17 100 1 40000 0
12  99 1 40000 0
 5 102 1 35000 0
12 102 1 35000 0
17 102 1 35000 0
12 101 1 35000 0
13 100 0 40000 0
13 102 0 40000 0
 8  98 0 40000 0
 5 103 1 35000 0
18 102 1 35000 0
 5  97 1 40000 0
12 100 1 40000 0
17 103 1 35000 0
 9 102 0 40000 0
13 103 0 40000 0
 8 102 0 40000 0
 5  98 1 40000 0
 9  99 0 40000 0
14 100 0 40000 0
13  99 0 40000 0
 8 100 0 40000 0
14  97 0 40000 0
 9 103 0 40000 0
17 101 1 35000 0
13  98 0 40000 0
13  97 0 40000 0
14 103 0 40000 0
 5 101 1 35000 0
 8  97 0 40000 0
12 103 1 35000 0
18 101 1 35000 0
 8 101 0 40000 0
14 102 0 40000 0
 5  99 1 40000 0
 9 100 0 40000 0
18  99 1 40000 0
13 101 0 40000 0
 9  97 0 40000 0
18  97 1 40000 0
 2  98 1 30000 1
 1 102 1 25000 1
20  98 1 30000 1
 2  97 1 30000 1
15  97 0 30000 1
19  98 1 30000 1
16 101 0 30000 1
 1 100 1 30000 1
20 103 1 25000 1
19 101 1 25000 1
 2 101 1 25000 1
19 103 1 25000 1
 2  99 1 30000 1
 1 103 1 25000 1
15  99 0 30000 1
20 102 1 25000 1
 6  97 0 30000 1
 7 103 0 30000 1
 1 101 1 25000 1
15 100 0 30000 1
16  97 0 30000 1
20 100 1 30000 1
 1  97 1 30000 1
 6 101 0 30000 1
20  97 1 30000 1
19  97 1 30000 1
16  99 0 30000 1
16 103 0 30000 1
 2 100 1 30000 1
16 100 0 30000 1
15 101 0 30000 1
15 102 0 30000 1
 1  99 1 30000 1
 7 100 0 30000 1
15 103 0 30000 1
 7  98 0 30000 1
16  98 0 30000 1
 2 102 1 25000 1
 7  97 0 30000 1
 6 100 0 30000 1
15  98 0 30000 1
 7  99 0 30000 1
 1  98 1 30000 1
16 102 0 30000 1
end

Tags: None

Chen Samulsion

Join Date: Jan 2018

Posts: 923
#2

09 Nov 2021, 08:25

A typo?

Code:

replace ln1_salary = ln(ln1_salary)
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#3

09 Nov 2021, 08:30

Gustav:
In addition to the likely typo spotted by Chen, please note that such a small constant will give back a negative salary on a ln scale (which is absurd).
As usual, the issue is that -ln(0)- does not exist, and any trick to make this calcultaion feasible biases the results (as the constant you choose obvioulsy plays a role in it).

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#4

09 Nov 2021, 08:34

The logarithm of 0 is undefined, by definition. Economics is not my area, so it may make sense to add some small value as offset. A potential problem is that calculating percentage change for individuals who were unemployed (no income) at baseline will have a meaningless value for this parameter because the true percentage change is also undefined, and it will be very sensitive to the choice of offset. I might consider checking my results with a subset that excludes those unemployed at baseline, if they is sensible in your research context.

Edit: crossed with #3

Last edited by Leonardo Guizzetti; 09 Nov 2021, 08:37.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#5

09 Nov 2021, 08:39

This is also a great example why you should not do that. Adding such a small number seems like not a big deal: an income of 0.00001 is pretty close to 0. However, just as the log-transfomration "pulls in" extremely high incomes, it also "stretches out" the exteme low numbers, so now you have created huge outliers:

crossed with #2 and #3, which I agree with (not surprising, since we are making the same point). However, I would not say that a negative salary on the log scale is absurd: it just means that the salary is somewhere between 0 and 1.

Last edited by Maarten Buis; 09 Nov 2021, 08:44.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
3 likes
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#6

09 Nov 2021, 08:40

Gustav, adding a number, regardless how small it is, to a variable changes its distribution, and is abandoned by more and more researchers. You may use -poisson- regression and after that you are still able to calculate the semi-elasticity as with log transformation.

Code:

poisson salary x..., vce(robust) margins, eydx(x)

Add: crossed with #5

Last edited by Fei Wang; 09 Nov 2021, 08:46.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#7

09 Nov 2021, 08:41

Choose a more appropriate model; Poisson, perhaps, or some other GLM with a log-link

Edit: crossed with #6
Comment

Gustav Egede Hansen

Join Date: May 2021
Posts: 94

09 Nov 2021, 09:03

Hi everybody

thanks for the quick and useful answers. Just realized an error with my example data (did not specify the count in dataex).

In relation to the many suggestions to use a different model, I was actually looking to use the following one:

Code:

sort id time
replace salary = 0 if id == 1 & time == 99
qui reg salary i.time##i.treatment##i.gender
margins treatment, at(time=(99 101) gender =(1))        
margins, eydx(treatment) at(time=(99 101) gender =(1)) post
margins, coeflegend
lincom _b[1.treatment:2._at]-_b[1.treatment:1bn._at]
di (exp(r(estimate))-1)*100

But when I use many control variables on a rare outcome (say government benefits), the model breaks down (and that was why I considered the 0.0001 solution). Any ideas why the eydx-model is not running?

Here is the updated data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(id time treatment salary gender)
 8 103 0 40000 0
 8  99 0 40000 0
17  97 1 40000 0
 9  98 0 40000 0
18 103 1 35000 0
18  98 1 40000 0
14  99 0 40000 0
17  99 1 40000 0
12  97 1 40000 0
 5 100 1 40000 0
12  98 1 40000 0
14 101 0 40000 0
14  98 0 40000 0
18 100 1 40000 0
17  98 1 40000 0
 9 101 0 40000 0
17 100 1 40000 0
12  99 1 40000 0
 5 102 1 35000 0
12 102 1 35000 0
17 102 1 35000 0
12 101 1 35000 0
13 100 0 40000 0
13 102 0 40000 0
 8  98 0 40000 0
 5 103 1 35000 0
18 102 1 35000 0
 5  97 1 40000 0
12 100 1 40000 0
17 103 1 35000 0
 9 102 0 40000 0
13 103 0 40000 0
 8 102 0 40000 0
 5  98 1 40000 0
 9  99 0 40000 0
14 100 0 40000 0
13  99 0 40000 0
 8 100 0 40000 0
14  97 0 40000 0
 9 103 0 40000 0
17 101 1 35000 0
13  98 0 40000 0
13  97 0 40000 0
14 103 0 40000 0
 5 101 1 35000 0
 8  97 0 40000 0
12 103 1 35000 0
18 101 1 35000 0
 8 101 0 40000 0
14 102 0 40000 0
 5  99 1 40000 0
 9 100 0 40000 0
18  99 1 40000 0
13 101 0 40000 0
 9  97 0 40000 0
18  97 1 40000 0
 2  98 1 30000 1
 1 102 1 25000 1
20  98 1 30000 1
 2  97 1 30000 1
15  97 0 30000 1
19  98 1 30000 1
16 101 0 30000 1
 1 100 1 30000 1
20 103 1 25000 1
19 101 1 25000 1
 2 101 1 25000 1
19 103 1 25000 1
 2  99 1 30000 1
 1 103 1 25000 1
15  99 0 30000 1
20 102 1 25000 1
 6  97 0 30000 1
 7 103 0 30000 1
 1 101 1 25000 1
15 100 0 30000 1
16  97 0 30000 1
20 100 1 30000 1
 1  97 1 30000 1
 6 101 0 30000 1
20  97 1 30000 1
19  97 1 30000 1
16  99 0 30000 1
16 103 0 30000 1
 2 100 1 30000 1
16 100 0 30000 1
15 101 0 30000 1
15 102 0 30000 1
 1  99 1 30000 1
 7 100 0 30000 1
15 103 0 30000 1
 7  98 0 30000 1
16  98 0 30000 1
 2 102 1 25000 1
 7  97 0 30000 1
 6 100 0 30000 1
15  98 0 30000 1
 7  99 0 30000 1
 1  98 1 30000 1
16 102 0 30000 1
 6  99 0 30000 1
 7 101 0 30000 1
19 102 1 25000 1
 6  98 0 30000 1
19  99 1 30000 1
20 101 1 25000 1
 7 102 0 30000 1
 6 103 0 30000 1
 6 102 0 30000 1
20  99 1 30000 1
19 100 1 30000 1
 2 103 1 25000 1
end

Last edited by Gustav Egede Hansen; 09 Nov 2021, 09:22.

Comment

Gustav Egede Hansen

Join Date: May 2021

Posts: 94
#9

09 Nov 2021, 09:27

When using the eydx-approach on the aforementioned rare outcome with many control I get the error: "inconsistent estimation sample levels 0 and 1 of factor treatment"
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#10

09 Nov 2021, 09:39

Gustav, I assume you mean "rare outcome" by an outcome variable with many zeros and only a few positive values. An estimation could fail with many controls because the variation in DV is not sufficiently large. Given the data structure and sample size, one solution is to reduce control variables and only include key factors. Other than that, I don't think there is a clear answer. But there are methods for outcomes with many zeros, like -zip- for poisson with many zeros and -relogit- for logit with many zeros -- You may have a try, but I'm not sure they'll solve your problem, particularly when linear estimation has broken down.

By the way, an advantage of using -poisson- is that the reported coefficient has already been the semi-elasticity (if you don't have interactions) and you won't need -margins, eydx()- which could be the cause of the error message.

Last edited by Fei Wang; 09 Nov 2021, 09:55.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#11

09 Nov 2021, 12:35

Late to this party. I agree with most points already made, but not all. More than 30 years ago, John Tukey called transformations the worst untaught part of data analysis, or words to similar effect, and something like that remains true.

Transformations of outcomes for some purposes have been superseded by what in generalized linear model jargon are called link functions whereby (in particular) estimation on log scale goes with the idea that some zeros or even negative values in the data can be accommodated because the leading idea is that the mean function is positive, not that all values are. Poisson regression is one version of this idea.

There are many reasons for transformations otherwise even now, which range from pure convenience (don't knock it; it beats inconvenience) through easier visualizations to keeping an eye on the assumptions (almost always a better wording would be: ideal conditions) for whatever model you are fitting.

One thing we should be able to agree on easily is that any intuition that log(x + smidgen) is a good idea to cope with zeros in x is correct only for x >> 1 (for which the fudge isn't needed) and utterly wrong for x << 1 (for which the fudge supposedly is the fix). So, to pile yet further on a point already very well made, we can use Mata to show that it can produce outliers at the low end of the data.

Code:

: log10((0, 1, 10, 100, 1000)) 1 2 3 4 5 +---------------------+ 1 | . 0 1 2 3 | +---------------------+ : log10((0, 1, 10, 100, 1000) :+ 0.0001) 1 2 3 4 5 +-----------------------------------------------------------------------+ 1 | -4 .0000434273 1.000004343 2.000000434 3.000000043 | +-----------------------------------------------------------------------+

The real shame here is with whoever makes this suggestion in print. Please tell us their textbooks so that we know to avoid them.

Using base 10 for logarithms is pure convenience there, and naturally (so to speak) the same idea applies with any other base, e, 2, 42 or 666. Adding smidgen = 0.0001 is fine with 1 10 100 1000 and so on, which is the intuition here, but not at all fine at the other end.

I used to think that log(x + 1) was not much better, but I've found occasions where it is helpful (e.g. for visualization). It can be helpful that log(x + 1) is close to x near 0 and that its derivative at 0 is 1. Stata like some other software supports this as log1p() which is intriguing. Further there is a small but notable literature on asinh() and on sign(x) log(1 + |x|) which both can work nicely with outcomes that can be negative, zero, or positive. These functions have loosely similar behaviour, which fact follows ineluctably from their similar definitions.

I am less convinced by grumbles that a transformation changes a distribution -- that's often precisely the point, or much of it -- or that elasticity calculations become awkward. Elasticity calculations become awkward if there are zeros in the data, period.

Naturally the practical concern that zeros may mean missing or invalid or something not zero but unknown is utterly orthogonal and sometimes the major issue.

I want to connect this discussion with pharmacology where evidently a standard design is giving doses of some drug or other substance of 0 at one extreme and otherwise of doses that are equally spaced on a logarithmic scale. Workers in this field must have ways of dealing simultaneously with zeros and a log scale, but some readers may be able to point to authoritative and/or comprehensive discussions.
5 likes
Comment
Gustav Egede Hansen

Join Date: May 2021

Posts: 94
#12

10 Nov 2021, 08:25

Hi all

Thank you all for the replies. A poisson regression seems to work on the example data; however, the model cannot run when I apply analytical weights (stemming from a Coarsned Exact Matching procedure), which is the case with my real data. Does anyone know of alternative approaches?

Best,
Gustav
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#13

10 Nov 2021, 08:27

glm supports aweights.
Comment
Gustav Egede Hansen

Join Date: May 2021

Posts: 94
#14

11 Nov 2021, 03:44

Thanks you all again for your replies! In my actual data, I observe 9700 panels monthly in a span of 12 years. I want to measure salary, working hours per month, and benefits. Salary and working hours are continuous variables (working hours ranges from 0 to approx. 300 hours per month), they are right-skewed, have some 0's, and contain only positive values. Benefits is also continuous and right-skewed, but has more 0’s and even negative values (some individuals are paying back benefits in some months).
Following the many advice, I have looked into glm. Given nature and the distributions of salary and working hours, it seems appropriate to specify glm's with a gaussian, poisson, or gamma family - all with a log-link:

Code:

glm salary i.time##i.treatment##i.gender covariates, cluster(id) family(gaussian) link(log) *or glm salary i.time##i.treatment##i.gender covariates, cluster(id) family(poisson) link(log) *or glm salary i.time##i.treatment##i.gender covariates, cluster(id) family(gamma) link(log)

After running the above, I would then run the following to yield the percentage change from before (time = 99) and after the treatment (time = 101):

Code:

margins treatment, at(time=(99 101) gender =(1)) margins, eydx(treatment) at(time=(99 101) gender =(1)) post margins, coeflegend lincom _b[1.treatment:2._at]-_b[1.treatment:1bn._at] di (exp(r(estimate))-1)*100

I understand that you cannot say without knowing my data, but does the above seem appropriate? And if so, would it be reasonable to provide the percentage change as a range following the glm results from the three families (they seem to produce more or less similar results)? Or is there a diagnostic tool capable of deciding between the three? Given the continuous nature of the salary and working hours and the sample size, I’am myself more inclined to go with the gaussian-family.

With the benefits-variable, the poisson-option is not working (from my understanding because counts can only be positive), but the gamma- and gaussian option does. Although, now with more different results. Unfortunately, I cannot find clear guidelines about what to do when having an outcome with negative values, so does anyone have a suggestion?

Best,
Gustav
Comment
Gustav Egede Hansen

Join Date: May 2021

Posts: 94
#15

11 Nov 2021, 06:23

I further realized that I should use xtset and xtgee/xtpoisson. It also stroke me that I need to incorporate fixed effects in my models (when I previously approached my data as normally distributed continuous variables, I used pooled OLS, which, to my knowledge, is the same as fixed effects when conducting a diff-in-diff with constant control variables (i.e. no need to xtset my data). Following this discussion (https://www.statalist.org/forums/for...-fixed-effects), I found that I could do - xtpoisson yvar i.time##i.treatment, fe vce(robust)- if my outcomes are not normally distributed but, to be more flexible in the future, I would also like to know if it is possible to incorporate fixed effects with xtgee?

And again, I am still puzzled as to whether salary and working hours need to be approached with a different method than OLS regression (looking at literature from my field, I have encountered many how use OLS to study, e.g., salary)?

Last edited by Gustav Egede Hansen; 11 Nov 2021, 06:59.
Comment

Announcement

Log transformation of 0 is not working (using +0.0001)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment