log transformation of a ratio variable with small values

Sang-Bum Park

Join Date: Jun 2015

Posts: 51
#1

log transformation of a ratio variable with small values

16 Feb 2019, 10:42

I log-transformed a ratio variable (sem), which ranged from 5.73e-10 to .0001021 (mean: 1.72e-06, s.d.: 2.15e-06).

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
sem | 9176 1.72e-06 2.15e-06 5.73e-10 .0001021

To prevent negative numbers after log transformation, I added 1 to the original variable like below.
gen lsem=log(sem+1)

But, the log transformation was failed as follows. All values remained unchanged.

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
sem | 9176 1.72e-06 2.15e-06 5.73e-10 .0001021
lsem | 9176 1.72e-06 2.15e-06 5.73e-10 .0001021

What was wrong? How can I log-transform this variable?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29898
#2

16 Feb 2019, 11:20

If you expand log(1+x) in a Taylor's series you get x - x²/2 + x³/3 - x⁴/4 ... Because the numbers you are using are all smaller (most of them a great deal smaller) than 10^-3, the quadratic and higher order terms are all smaller than 10^-6. So, to at least 6 decimal places, log(1+x) is the same as x. In fact, this approximation is often exploited to speed up calculations or simplify equations in many contexts.

May I ask why you want to eliminate negative values for log(x)? How will that be a problem for you?
3 likes
Comment
Sang-Bum Park

Join Date: Jun 2015

Posts: 51
#3

16 Feb 2019, 16:37

Schechter, thank you very much for your kind answer. Two reasons made me do this. One was that the initial log-transformation resulted in all negative values. Another was that I log-transformed this variable in order to make interaction terms with other variables and positive values seemed better for interpreting interaction effects. Specifically, before adding 1 (i.e., log(sem+1), I log-transformed as below.

gen lsem=log(sem)
sum lsem
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
lsem | 9176 -13.87985 1.417644 -21.27972 -9.189582

The problem was that all values were negative although the original values are all positive. Is it better to use this? It seemed reasonable and the result showed a normal distribution (attached). But, I am hesitating to use it because of the above reasons.

Attached Files

Last edited by Sang-Bum Park; 16 Feb 2019, 16:54.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29898
#4

16 Feb 2019, 18:32

Well, yes, the logarithms will all be negative because the original values are all less than 1. But that is not a problem at all.

I wouldn't describe the distribution of log(x) that you show as normal. It's probably closer to normal looking than that of x itself. But that then raises another question: why do you want this variable to have a normal distribution.

There is a widespread erroneous belief that variables need to have normal distributions to use them in linear regressions. It is still even widely taught to day although it is completely wrong. On the outcome variable side, the most that one can say of this nature is that the residuals, not the outcome variable itself, need to be normally distributed in order for the p-values and t-tests to work properly in a small sample. But if your sample is has 9,716 observations the central limit theorem will rescue the t-tests and p-values from all but the most extreme violations of the normality of residuals. And on the predictor variable side there are no distributional requirements of any kind at all.

So really, the only reason you should be thinking about log-transforming this variable is if there is reason to believe that log x is linearly related to your outcome variable but x itself is not (or if x is the outcome variable that log x is linearly related to the predictor(s) but x itself is not).
6 likes
Comment
Sang-Bum Park

Join Date: Jun 2015

Posts: 51
#5

17 Feb 2019, 08:35

I am very grateful for your excellent and very helpful answer: it is normal that values smaller than 1 are changed into negative values after log-transforming and the residuals of a dependent variable need to be normally distributed and this can be exempted in a large sample. Okay, I will use original values without a log-transformation. I don't know whether results are different according to log-transforming. Thank you!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35327
#6

17 Feb 2019, 09:01

You haven't addressed whether the relationship (conditional on the other predictors) is more nearly linear after transformation. With a range of values over 6 orders of magnitude from 5.73e-10 to .0001021 on the original scale I would not be surprised at outliers that exert considerable leverage. Added variable plots should help you decide.

Last edited by Nick Cox; 17 Feb 2019, 09:33.
1 like
Comment
Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#7

17 Feb 2019, 09:23

Excellent comments by Clyde Schechter and Nick Cox, as usual. I just want to add that nothing has been said about what model is trying to be estimated, or if he's just using OLS for this. Now, one thing he may want to consider is that sem seems like a fractional response variable, since all of its values are between 0 and 1. Although somewhat appropriate if you want to do predictions at the means, OLS is not appropriate if you're trying to do predictions at other values, particularly far away from the mean, because there is nothing stopping it from predicting a value less than 0 or greater than 1. An alternative is to use glm with a binomial family and a logit link, as indicated in Papke and Wooldridge (1993), and Papke and Wooldridge (1996). I cite both, because the first one extends the second in covering cases where we need to work with the total cases in each proportion.

References:
Papke, Leslie E. and Jeffrey M. Wooldridge. 1993. Econometric Methods for Fractional Response Variables with an Application to 401(K) Plan Participation Rates. NBER Technical Working Paper 147.

Papke, Leslie E. and Jeffrey M. Wooldridge. 1996. Econometric Methods for Fractional Response Variables with an Application to 401(K) Plan Participation Rates. Journal of Applied Econometrics 11(6): 619-632.

Alfonso Sanchez-Penalver
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35327
#8

17 Feb 2019, 09:36

Alfonso Sánchez-Peñalver I read #3 as implying that this is a predictor. If not, it could hardly appear in interaction terms.
1 like
Comment
Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#9

17 Feb 2019, 09:43

Nick Cox oh you are right, but if that's the case why do we care if it is normally distributed? The only reason to log-transform it would be if the relationship with the explained variable was monotonically nonlinear. Multiplying the variable by 100 to have it in percent form could be an option there, but for values that are very small they will still remain negative when log-transformed.

The discussion about the distribution of the variable is what made me think that it was the explained variable. Sorry about that.

Alfonso Sanchez-Penalver
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35327
#10

17 Feb 2019, 09:53

It doesn't have to be normally distributed. But marked skewness on the original scale -- which is established by #3 -- is often in practice associated with difficulties in establishing a linear relationship. Like you, I would underline that curvature is often reduced by taking logarithms.

If the OP wants products of this and other variables, the problems could be compounded.

Normality isn't a goal, but saying that is no denial that approximate symmetry of distributions often makes analyses easier, if only if for the simpler behaviour of relationships they often imply.

I assume that in #1 the OP meant a ratio calculated from some numerator/some denominator. Log scale is often a natural choice for ratios given log ratio = log numerator - log denominator
1 like
Comment
Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#11

17 Feb 2019, 10:16

Hi Nick, I agree with what you're saying, but I was just saying that it is not necessary particularly if the relationship really seems to be linear.

If you have the values of the numerator and the denominator... why not enter each separately log-transformed? That way you're not restricting the coefficient on one of the variables being the exact same value as the one in the other variable but with a negative sign, thus allowing you to test whether the log of the proportion is appropriate or you may have a misspecification. Something that can be tested with a simple t-test. It also allows the interaction of both variables to have a different effect. Of course if you don't have the two variables to do this it won't be possible.

Alfonso Sanchez-Penalver
Comment
Sang-Bum Park

Join Date: Jun 2015

Posts: 51
#12

17 Feb 2019, 22:07

I am very grateful for your (Clyde Schechter, Nick Cox, and Alfonso Sánchez-Peñalver) excellent answers. For clarity, I detail the numerator and denominator of my variable (sem). Although the data differs from the above one (obs. 9176), variables are the same. This is a part of the (real)dataset that I am working on. A summary statistics of sem is as follows:

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
sem | 1316 .0010682 .0010715 5.73e-07 .0081791

sem is a ratio variable and the denominator (den) is very large while the numerator (num) is (relatively)small.

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
num | 1357 7588.634 12467.37 11 101970
den | 1333 1.28e+07 2.11e+07 261737.8 2.29e+08

If sem is log-transformed as a ratio (log(sem)), then the log variable (lsem) is as follows:

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
lsem | 1316 -7.546429 1.6295 -14.37197 -4.806176

As Clyde Schechter explained, all transformed log values were negative because a ratio is smaller than 1. He suggested me to use it. I call it as an option 1.

If I use log-transformed num and log-transformed den (log(num)/log(den)), then the result (lsem2) is as follows:

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
lsem2 | 1316 .5186528 .0909031 .1746379 .6448187

This is informed by Nick Cox. I call it as an option 2.

The correlation between lsem and lsem2 is very high (0.94) and significant (p<0.001).

In addition, I want to raise another issue, that is, a high correlation between num and dem (0.76, p<0.001).

My questions are twofold.
First, can I use option 2 instead of option 1? Or should I follow option 1?
Second, is there any problem in using a ratio with a high correlation between a numerator and a denominator?

Thank you all very much.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35327
#13

18 Feb 2019, 00:30

I did not suggest using log(num)/log(den), which is malformed. I just pointed that log(num/den) = log num - log den. I'd advise revision of your textbook or course notes on logarithms. So, Option 2 is a misreading and not a good idea.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35327
#14

18 Feb 2019, 01:58

I see no problem in correlation of numerator and denominator. For example, consider a bunch of countries and numerator number of Stata users and denominator number of intelligent population. You'd expect a correlation; much of the point of a ratio is to adjust for it.

I mostly agree with Alfonso but I think you need a substantive reason to enter numerator and denominator separately as predictors.

Notice that we're in the dark here even on what this ratio is. The substance matters, as a ratio could be anything from a standard measure that people in your field are comfortable thinking about as a well-defined predictor to something more esoteric or ad hoc.
Comment
Sang-Bum Park

Join Date: Jun 2015

Posts: 51
#15

18 Feb 2019, 08:43

Nick Cox Thank you for your answer. I misunderstood your point and find that the difference between log numerator and log denominator does equate with log ratio (num/den). Specifically, I revised lsem2 of #12 as follows.

gen lsem2=log(num)-log(den)

This variable is the same with lsem (og(num/den)). I also appreciate your kind explanation on my second question.
Comment

Announcement

log transformation of a ratio variable with small values

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment