What to do with: zero values on an independent variable in logistic regression analyses

EdieMMM

Join Date: May 2014

Posts: 7
#1

What to do with: zero values on an independent variable in logistic regression analyses

12 May 2014, 16:50

Hi All:

Hoping that I can get some suggestions on the following:

I am using a continuous income variable as one of my predictors of a binary dependent variable. Nearly 4 percent of my cases have a value of 0 for the income variable. I think that using logged income is the best approach. I am using the svy: logistic command with survey data.

I have done some reading but am still not sure whether it is better to:

1) Use a logged income variable that drops the 3.4 percent of cases as missing (N=981)

or

2) Change the value of income to $1 (or something else) and then do the log transformation. (N=1013)

I have tried both in logistic regression analyses with different results, keeping all other variables exactly the same. With option #1, the odds ratio of the income variable is large and significant; another critical variable that I'm more interested in is not significant. With option #2, odds ratio close to 1 and insignificant, but the same other critical variable is significant.

Any suggestions? I've read about creating a dummy variable for the cases where income is $0 and using in conjunction with option #1, but haven't tried that yet.
Thanks so much!
Edie

Examples of what I've read:

http://www.stata.com/statalist/archi.../msg00018.html

http://www.stata.com/statalist/archi.../msg00629.html
Tags: None
Richard Williams

Join Date: Apr 2014

Posts: 4945
#2

12 May 2014, 20:11

Welcome to the board Edie. Note that Stata protocol is to use your real name. You can either ask the board administrators to change your id or add a signature to your posts.

The adding a dollar approach is generally frowned upon. For an independent variable, taking the cube root sometimes works well. I'm sure others can give you more advice.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#3

13 May 2014, 05:29

Richard gave very good advice. It's difficult to add to that, particularly without knowing anything about your data. At worst, zero really means missing rather than zero. There can't be a single correct identifiable way to proceed without knowing whether the zeros really are correct. I am a great fan of cube roots, but I suspect the problem is that for whatever you are doing log income is a good version of income; it's just that those zeros get in the way.
Comment
Jeph Herrin

Join Date: Apr 2014

Posts: 332
#4

13 May 2014, 06:24

For highly skewed independent continuous variables - especially ones with zeroes - I usually categorize them. This avoids having to make any assumptions or transformations, and allows exceptional values (such as 0) to have their own category. Cutpoints can be chosen based on inspection of the distribution; if you think it's log normal you can even use cutpoints of 10^n, e.g.:

Code:

gen income_cat = irecode(income,0,100,1000,10000,10000,.)

Then use i.income_cat as your independent variable.
Comment
Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#5

13 May 2014, 06:36

Hi, I remember having a good discussion with Marteen Buis about this in a post in the old statalist mail list, which can be found on the archive:

http://www.stata.com/statalist/archi.../msg00788.html

We disagreed on the methods of how to deal with the zeros on explanatory variables, but I hope that it will provide you with good information. Richard, Nick, could you please expound on the cubic root method and what it does?

Best,

Alfonso Sanchez-Penalver
Comment
Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#6

13 May 2014, 06:42

I remember discussing this on a post with Marteen Buis in the old statalist mail list that is available in the archive at
http://www.stata.com/statalist/archi.../msg00788.html
We were discussing how to deal with the zeros in explanatory variables, and each one had his own opinion. Even though we didn't come to a consensus, it should illustrate some options you have available for you.

Richard, Nick, could you please expound on the cubic root method and what it's supposed to do?

Jeph, in your method wouldn't you need to then spline the resulting categorical variable? After all you would want the predicted values to be the same at the cutpoints for the different categories, wouldn't you?

Best,

Alfonso Sanchez-Penalver
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#7

13 May 2014, 07:27

The main point to cube roots here is simply that the cube root of 0 is 0, so there is no doubt what to do with zeros. In the particular case of income, there are so many economic and statistical grounds to think that log income is a natural scale that cube roots are unlikely to help. In other cases, a real advantage to cube roots is that they are perfectly well defined for negative values, but again that is not obviously germane here.

A main point I recall from the previous thread was to underline that some people are tempted by log (variable + very small) as seemingly close to log variable, and as a seemingly conservative adjustment. Quite the contrary, it implies a drastic change to the data and just creates very large negative outliers. The point is clearer with log10 as log10(0 + 1e-6) = -6, log10(0 + 1e-9) = -9 and so forth.

P.S. For "Marteen" read "Maarten", here and everywhere in the Stata world.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#8

13 May 2014, 07:45

Originally posted by Nick Cox View Post

P.S. For "Marteen" read "Maarten", here and everywhere in the Stata world.

That is also true outside the Stata world.

(yes, I do have a life outside the Stata world. Right now it consists of my son filling his diaper while sitting on my lap. I will safe you the details like smell, colour, and consistancy.)

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#9

13 May 2014, 08:04

Maarten: Too much information already!
Comment

Richard Williams

Join Date: Apr 2014
Posts: 4945

#10

13 May 2014, 10:48

Conceptually I would have a hard time explaining why cube root makes sense. Empirically it often seems to work pretty well. It can deal with 0 and negative values and helps deal with nonlinear effects. At least in this example, the logs of the variables and the cube roots correlate very highly:

Code:

. webuse nhanes2f, clear

. gen lnweight = ln(weight)

. gen cubeweight = weight ^ (1/3)

. gen lnheight = ln(height)

. gen cubeheight = height ^ (1/3)

. corr weight lnweight cubeweight height lnheight cubeheight
(obs=10337)

             |   weight lnweight cubewe~t   height lnheight cubehe~t
-------------+------------------------------------------------------
      weight |   1.0000
    lnweight |   0.9891   1.0000
  cubeweight |   0.9951   0.9988   1.0000
      height |   0.4777   0.4985   0.4931   1.0000
    lnheight |   0.4752   0.4969   0.4912   0.9994   1.0000
  cubeheight |   0.4761   0.4975   0.4919   0.9997   0.9999   1.0000

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Comment

Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#11

13 May 2014, 11:17

Thanks Nick and Richard. Sorry for the miss spelling Maarten.

Alfonso Sanchez-Penalver
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#12

13 May 2014, 11:38

Richard picked an example with heights where no (standard) transformation would be very nonlinear. The ratio of max height/min height is 1.48, so even the correlation of height and log height will be very high given that the logarithm function is close to linear over a small range. Weights vary by a factor of 5.7 so even with that variable variation is less than an order of magnitude.

Last edited by Nick Cox; 13 May 2014, 11:59.
Comment
EdieMMM

Join Date: May 2014

Posts: 7
#13

13 May 2014, 11:57

Thanks so much for the advice and the link to the archived discussion! I'll check it out. I'll try the categorical approach too.

Appreciate it.
-Eileen Diaz McConnell (someone who hates to have an internet presence, anywhere and at any time...)
Comment
Mario Scintille

Join Date: Aug 2015

Posts: 27
#14

10 Feb 2016, 01:52

dear all,

I'd like to follow up on this discussion. I am running a model on count data and therefore I am using poisson-like regression. However, since the strict exogeneity assumption between one regressor and dependant variable is likely to be violated I am trying to use the Pre Sample Mean estimator suggested by Blundell et all (2003). This basically boils down to insert the log of the pre-sample mean of the dependant variable among the regressors. My problem is that I have several values equal to zero in the pre-sample mean and therefore the mean is equal to zero and therefore I cannot compute the log. Any advice on how to correctly transform these data in order to have the log of the presample mean?

Thanks

Blundell R. Griffith R., Windmeijer F. Individual effects and dynamics in count data models. Journal of Econometrics 108 (2002) 113–131

1 Photo
Comment
Charlotte Smith

Join Date: Feb 2017

Posts: 27
#15

13 May 2017, 07:38

Hello guys,

I've a similar issue. In my dataset, some of the observations for my control variables (being inventories and leverage, both variables that can be actually 0 in real life). No i was wondering, if i run the following regression,

Code:

logit Y X inventory leverage

then there is (probably) an issue with my linearity between the continuous predictor variables (inventories + leverage) and their log odds.
How should i handle this? (as droppig these variables doesnt seem logical)
Comment

Announcement