Log transforming variables

Jessica Berrett

Join Date: Sep 2019

Posts: 57
#1

Log transforming variables

30 Sep 2020, 10:09

Two questions related to log transforming variables.

I understand that we would want to log transform the dependent variable if normal distribution is not present. However, what if you take the log and you still don't have a normal distribution?

I am not clear on how to determine if you should take the log of independent variables.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35697
#2

30 Sep 2020, 10:25

There is, or should be, detailed discussion of this in your regression (economists say "econometrics") text. and if not you need a better text.

Let's assume a context of regression. The essential core of regression is, I suggest, taking Y = Xb as a summary function of the mean of Y as a function of X. It's not assumed that any of those variables has a normal distribution. At most it's convenient if -- conditional on X -- Y has a normal distribution, as that makes some things easier.

Taking logarithms is

* hardly possible if any value is zero or negative, although various tricks and devices have been suggested for this case

* sometimes helpful otherwise if a variable is right-skewed -- or more importantly if that gets us nearer to linearity of relationship and/or independent and homoscedastic errors

* usually a bad idea even otherwise, as likely to make matters worse.

Logarithms make most sense when your basic idea is that generating processes are essentially multiplicative, as with say exponential or power law relationships. Don't fall into regarding it as yet another bizarre ritual statistical manipulation on a par with reading tea leaves or haruspicy.
2 likes
Comment
Jessica Berrett

Join Date: Sep 2019

Posts: 57
#3

30 Sep 2020, 11:02

Thank you Nick. Can you recommend a text?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#4

30 Sep 2020, 11:04

What's your field? If it is econ* I note that Jeff Wooldridge has written a couple of texts.

https://stats.stackexchange.com/ques...independent-va has many good specific points -- and rightly conveys some difference of views on quite how useful transformation is.

Last edited by Nick Cox; 30 Sep 2020, 11:11.
1 like
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#5

30 Sep 2020, 11:20

Everything that Nick says is reasonable enough to an econometrician as well, except probably the statement "There is, or should be, detailed discussion of this in your regression (economists say "econometrics") text. and if not you need a better text."

It is just that in econometrics we do not have the tendency to transform our variables that much, we transform a lot less than they do in statistics, and hence in most reputable texts on econometrics, including all by Professor Wooldridge, there is no much digging into how to transform your variables.

If the conditional expectation is E(y|X) = Xb, that is roughly linear in X, this pretty much completes the story to an econometrician.

You transform your dependent variable if you are worried that your expectation is not roughly linear. E.g., if you have a Cobb-Douglas production function y = A*L^a*K^b, this is very clearly nonlinear and taking logs on both sides is necessary.

As for a book where those things are discussed, I think this one has more on transformations
Hamilton, L. C. (1992). Regression with graphics: A second course in applied statistics.

I taught following more or less this book, and I remember I was teaching some "ladder of powers", some Box-Cox transformations, I think it was from this book.

Last edited by Joro Kolev; 30 Sep 2020, 11:23.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#6

30 Sep 2020, 11:25

I am just looking at economics from the outside, modulo grade A in Economics A level in 1969 (*). But I don't know: my friends who deal with income inequality tell me that thinking about log income is utterly natural, people talk about % change in prices and wages and GDP and trade all the time, which are just multiplicative last I heard, and what about gravity models (+) returns, and so on.

(*) British readers will appreciate that that was a spectacular achievement.

(+) nothing to do with Isaac Newton, Albert Einstein or Stephen Hawking
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#7

30 Sep 2020, 11:31

Originally posted by Joro Kolev View Post

It is just that in econometrics we do not have the tendency to transform our variables that much,

That is not my impression. Both logarithms and squares seem to be very prevalent.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#8

30 Sep 2020, 11:46

Yes, and no.

Yes in the sense that if you need to make the conditional expectation E(y|X) linear in X, you transform what you need to, in order to achieve linearity. Hence you might see lots of regressions of log(wage) on experience and experience^2.

No in the sense that there is no attempt to make the dependent variable "close to normally distributed", so pretty much never anything of the sort of Box-Cox ladder of powers transformations, which follow some statistical criterion of closeness to normality.

So linear conditional expectation is considered important, normality not so much. Mostly because the samples of modern economics are large.

Also whatever transformations you have seen, I would guess that they were not statistically justified. Economists mostly follow what previous literature has done in terms of whether you take the log of your wage, and you use experience or experience in a quadratic.

Again in the course of statistics that I taught, I taught them a lot of diagnostic plots gazing. In econometrics there is no so much diagnostic plots gazing. The last relates to the point that if you want to know how you should specify your relationship between wages and experience, you mostly look at previous literature, rather than gazing at diagnostic plots to detect the nonlinearities.

Originally posted by Maarten Buis View Post

That is not my impression. Both logarithms and squares seem to be very prevalent.
Comment
Jessica Berrett

Join Date: Sep 2019

Posts: 57
#9

01 Oct 2020, 14:38

My research is in the fields of public administration and nonprofit management.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#10

01 Oct 2020, 16:42

What (kind of) texts would you normally use?
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#11

02 Oct 2020, 03:24

Originally posted by Joro Kolev View Post

E.g., if you have a Cobb-Douglas production function y = A*L^a*K^b, this is very clearly nonlinear and taking logs on both sides is necessary.

Actually, it is not necessary nor advisable to take logs on both sides because we can estimate y = exp[c + a*ln(L) + b*ln(K)] by Poisson regression, which is consistent under very mild conditions and makes prediction very easy.

Best wishes,

Joao
1 like
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

#12

02 Oct 2020, 03:58

I meant it is necessary if you want to do linear analysis.

If you want to do nonlinear analysis, the sky is the limit:

Code:

. sysuse auto, clear
(1978 Automobile Data)

. nl (price = mpg^{a}*headroom^{b}*{c}), nolog robust
(obs = 74)


Nonlinear regression                                Number of obs =         74
                                                    R-squared     =     0.8807
                                                    Adj R-squared =     0.8757
                                                    Root MSE      =   2406.972
                                                    Res. dev.     =   1359.287

------------------------------------------------------------------------------
             |               Robust
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          /a |  -1.282429   .1852144    -6.92   0.000    -1.651736   -.9131219
          /b |  -.1884905   .1360615    -1.39   0.170    -.4597895    .0828084
          /c |   344694.5   212073.9     1.63   0.109    -78168.87    767557.9
------------------------------------------------------------------------------

. poisson price mpg headroom, robust

Iteration 0:   log pseudolikelihood = -31259.265  
Iteration 1:   log pseudolikelihood = -31259.202  
Iteration 2:   log pseudolikelihood = -31259.202  

Poisson regression                              Number of obs     =         74
                                                Wald chi2(2)      =      21.28
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -31259.202               Pseudo R2         =     0.2894

------------------------------------------------------------------------------
             |               Robust
       price |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -.0478065   .0107165    -4.46   0.000    -.0688104   -.0268026
    headroom |  -.0634943   .0482614    -1.32   0.188    -.1580849    .0310964
       _cons |   9.904525   .3278996    30.21   0.000     9.261853     10.5472
------------------------------------------------------------------------------

What is advisable is not rather settled yet, I think. I am aware of the literature that started with Manning and Mullahy, and to which you contributed too, advising poisson regression over linear log Y = Xb regression.

What I am not quite sure is what do we do as in the above, the -nl- I run assumed additive error, and the -poisson- assumed multiplicative error. So to me the elephant in the room is which of those two, not so much poisson vs. linear log Y = Xb regression.

Originally posted by Joao Santos Silva View Post

Actually, it is not necessary nor advisable to take logs on both sides because we can estimate y = exp[c + a*ln(L) + b*ln(K)] by Poisson regression, which is consistent under very mild conditions and makes prediction very easy.

Best wishes,

Joao

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35697
#13

02 Oct 2020, 04:41

Naturally, there are important differences between transforming variables and using a non-identity link function. Joao's approach for fitting power functions is a hybrid of the two, transforming predictors and using a link function for the outcome, and for excellent reasons, hinging on error structure and so forth.

I will blame Stas Kolenikov for the observation that economists seem especially fond of naming everything after two economists who didn't discover it but just used it a lot, such as Cobb and Douglas here. Geographers did invent maps, I suppose, which was a big deal, except that is a kind of retrospective disciplinary imperialism: whoever used maps we classify post hoc as a geographer, regardless.
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#14

02 Oct 2020, 05:19

Originally posted by Joro Kolev View Post

What I am not quite sure is what do we do as in the above, the -nl- I run assumed additive error, and the -poisson- assumed multiplicative error. So to me the elephant in the room is which of those two, not so much poisson vs. linear log Y = Xb regression.

Dear Joro Kolev,

The point is not whether the error is multiplicative or additive because that is immaterial in this context; the point is whether we assume the additive error to be homoskedastic or heteroskedastic. NLS assumes an homoskedastic additive error and Poisson assumes a particular heteroskedasticity pattern; all the simulation evidence suggests that NLS is very poorly behaved in many reasonable contexts and that, in contrast, Poisson regression works reasonably well in a wide variety of settings. Hence my strong preference for Poisson regression.

Best wishes,

Joao
1 like
Comment

Announcement