Would it be problematic to log my dependent variable? It takes values only between 0 and 1

Tatenda Smith

Join Date: Apr 2022

Posts: 7
#1

Would it be problematic to log my dependent variable? It takes values only between 0 and 1

01 May 2022, 06:58

Hi all, I am estimating a fixed effects model and the dependent variable in question is Human Development Index (HDI), which is always between 0 & 1. I want to have a log level model in stata, but I'm now curious- won't the log transformation muck up the model? Would it maybe be appropriate to +1 to all my data points in HDI to keep them above 1? Or should I just go for a linear model instead?
Tags: fixed effects, log transformed, panel, panel data, regression
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#2

01 May 2022, 08:12

Tetenda:
welcome to this forum.
1) clearly, -ln(0)- produces missing values; hence, the trivial question is: how many observations report 0?
2) adding a small constant in the attempt to fix the issue is not recommended;
3) you do not say if the distribution of your regressand is positively skewed (provided 1), logging reduces the skewness) or negatively skewed ( logging worsens the situation).
Personally, with a DV with many zeros, I would stick with a linear-linear model.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#3

01 May 2022, 09:48

I don't recollect, last time I used such data, that any country has HDI zero.

But if you're considering a logarithmic transformation you do need reasons why it is a good idea.

log(HDI + 1) is a poor idea if only because it doesn't make much difference, as a direct graph will show you:

Code:

twoway function log(x + 1), range(0 1)

(We can add that it choosing 1 as a constant is here highly arbitrary, as Carlo Lazzaro perhaps hints in #2.)

Without knowing how much difference it makes, my inclination is to take HDI as it arrives.
1 like
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1132
#4

01 May 2022, 09:56

Others who are unfamiliar with the Human Development Index (as I was) may find this page helpful. Don't miss the link to the technical notes.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
1 like
Comment
Tatenda Smith

Join Date: Apr 2022

Posts: 7
#5

03 May 2022, 11:35

Originally posted by Carlo Lazzaro View Post

Tetenda:
welcome to this forum.
1) clearly, -ln(0)- produces missing values; hence, the trivial question is: how many observations report 0?
2) adding a small constant in the attempt to fix the issue is not recommended;
3) you do not say if the distribution of your regressand is positively skewed (provided 1), logging reduces the skewness) or negatively skewed ( logging worsens the situation).
Personally, with a DV with many zeros, I would stick with a linear-linear model.

It's not that some observations are 0, but that they are between 0-1, and the log of these is negative. I'm not quite sure what you mean by bullet point 3?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#6

03 May 2022, 11:42

Tatenda:
I meant that, if the distribution of your regressand is negatively skewed, its shape (in terms of departing from the gaussian) is worsened by -ln()- transformation.
The opposite is expected if the distribution of your regressand is positively skewed (as it is frequent with total cost distributions, that usually have a long right-tail, that proves their positive skewness).

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#7

03 May 2022, 11:51

It's never a problem if logarithms are negative.

In any case, predicted values could (arguably, should) be reported on the original scale, as should axis labels on graphs etc. even if you use logarithmic scales.

But here is a graph based on 2020 HDI. The curve shows log HDI against HDI and is over the observed range from Norway downwards. The y axis is labelled in terms of HDI using mylabels from SSC.

There is, I guess, not nearly enough curvature here to make a big difference, but if there is a good argument for using logarithms otherwise, then go with it.
2 likes
Comment
Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#8

04 May 2022, 06:55

You might consider sidestepping the problem of linear v. log by going to an ordered probit model. There is a test for choosing the appropriate model (or an intermediate form) but the name and details escape me at this moment.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#9

04 May 2022, 07:03

[email protected] So each distinct value of HDI defines a separate level? I've seen that kind of model being advocated elsewhere.
Comment

Announcement

Would it be problematic to log my dependent variable? It takes values only between 0 and 1

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment