Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • What to do with: zero values on an independent variable in logistic regression analyses

    Hi All:

    Hoping that I can get some suggestions on the following:

    I am using a continuous income variable as one of my predictors of a binary dependent variable. Nearly 4 percent of my cases have a value of 0 for the income variable. I think that using logged income is the best approach. I am using the svy: logistic command with survey data.

    I have done some reading but am still not sure whether it is better to:

    1) Use a logged income variable that drops the 3.4 percent of cases as missing (N=981)

    or

    2) Change the value of income to $1 (or something else) and then do the log transformation. (N=1013)

    I have tried both in logistic regression analyses with different results, keeping all other variables exactly the same. With option #1, the odds ratio of the income variable is large and significant; another critical variable that I'm more interested in is not significant. With option #2, odds ratio close to 1 and insignificant, but the same other critical variable is significant.

    Any suggestions? I've read about creating a dummy variable for the cases where income is $0 and using in conjunction with option #1, but haven't tried that yet.
    Thanks so much!
    Edie



    Examples of what I've read:

    http://www.stata.com/statalist/archi.../msg00018.html

    http://www.stata.com/statalist/archi.../msg00629.html

  • #2
    Welcome to the board Edie. Note that Stata protocol is to use your real name. You can either ask the board administrators to change your id or add a signature to your posts.

    The adding a dollar approach is generally frowned upon. For an independent variable, taking the cube root sometimes works well. I'm sure others can give you more advice.
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    StataNow Version: 19.5 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

    Comment


    • #3
      Richard gave very good advice. It's difficult to add to that, particularly without knowing anything about your data. At worst, zero really means missing rather than zero. There can't be a single correct identifiable way to proceed without knowing whether the zeros really are correct. I am a great fan of cube roots, but I suspect the problem is that for whatever you are doing log income is a good version of income; it's just that those zeros get in the way.

      Comment


      • #4
        For highly skewed independent continuous variables - especially ones with zeroes - I usually categorize them. This avoids having to make any assumptions or transformations, and allows exceptional values (such as 0) to have their own category. Cutpoints can be chosen based on inspection of the distribution; if you think it's log normal you can even use cutpoints of 10^n, e.g.:

        Code:
        gen income_cat = irecode(income,0,100,1000,10000,10000,.)
        Then use i.income_cat as your independent variable.

        Comment


        • #5
          Hi, I remember having a good discussion with Marteen Buis about this in a post in the old statalist mail list, which can be found on the archive:

          http://www.stata.com/statalist/archi.../msg00788.html

          We disagreed on the methods of how to deal with the zeros on explanatory variables, but I hope that it will provide you with good information. Richard, Nick, could you please expound on the cubic root method and what it does?

          Best,
          Alfonso Sanchez-Penalver

          Comment


          • #6
            I remember discussing this on a post with Marteen Buis in the old statalist mail list that is available in the archive at
            http://www.stata.com/statalist/archi.../msg00788.html
            We were discussing how to deal with the zeros in explanatory variables, and each one had his own opinion. Even though we didn't come to a consensus, it should illustrate some options you have available for you.

            Richard, Nick, could you please expound on the cubic root method and what it's supposed to do?

            Jeph, in your method wouldn't you need to then spline the resulting categorical variable? After all you would want the predicted values to be the same at the cutpoints for the different categories, wouldn't you?

            Best,
            Alfonso Sanchez-Penalver

            Comment


            • #7
              The main point to cube roots here is simply that the cube root of 0 is 0, so there is no doubt what to do with zeros. In the particular case of income, there are so many economic and statistical grounds to think that log income is a natural scale that cube roots are unlikely to help. In other cases, a real advantage to cube roots is that they are perfectly well defined for negative values, but again that is not obviously germane here.

              A main point I recall from the previous thread was to underline that some people are tempted by log (variable + very small) as seemingly close to log variable, and as a seemingly conservative adjustment. Quite the contrary, it implies a drastic change to the data and just creates very large negative outliers. The point is clearer with log10 as log10(0 + 1e-6) = -6, log10(0 + 1e-9) = -9 and so forth.

              P.S. For "Marteen" read "Maarten", here and everywhere in the Stata world.

              Comment


              • #8
                Originally posted by Nick Cox View Post
                P.S. For "Marteen" read "Maarten", here and everywhere in the Stata world.
                That is also true outside the Stata world.

                (yes, I do have a life outside the Stata world. Right now it consists of my son filling his diaper while sitting on my lap. I will safe you the details like smell, colour, and consistancy.)
                ---------------------------------
                Maarten L. Buis
                University of Konstanz
                Department of history and sociology
                box 40
                78457 Konstanz
                Germany
                http://www.maartenbuis.nl
                ---------------------------------

                Comment


                • #9
                  Maarten: Too much information already!

                  Comment


                  • #10
                    Conceptually I would have a hard time explaining why cube root makes sense. Empirically it often seems to work pretty well. It can deal with 0 and negative values and helps deal with nonlinear effects. At least in this example, the logs of the variables and the cube roots correlate very highly:

                    Code:
                    . webuse nhanes2f, clear
                    
                    . gen lnweight = ln(weight)
                    
                    . gen cubeweight = weight ^ (1/3)
                    
                    . gen lnheight = ln(height)
                    
                    . gen cubeheight = height ^ (1/3)
                    
                    . corr weight lnweight cubeweight height lnheight cubeheight
                    (obs=10337)
                    
                                 |   weight lnweight cubewe~t   height lnheight cubehe~t
                    -------------+------------------------------------------------------
                          weight |   1.0000
                        lnweight |   0.9891   1.0000
                      cubeweight |   0.9951   0.9988   1.0000
                          height |   0.4777   0.4985   0.4931   1.0000
                        lnheight |   0.4752   0.4969   0.4912   0.9994   1.0000
                      cubeheight |   0.4761   0.4975   0.4919   0.9997   0.9999   1.0000
                    -------------------------------------------
                    Richard Williams, Notre Dame Dept of Sociology
                    StataNow Version: 19.5 MP (2 processor)

                    EMAIL: [email protected]
                    WWW: https://www3.nd.edu/~rwilliam

                    Comment


                    • #11
                      Thanks Nick and Richard. Sorry for the miss spelling Maarten.
                      Alfonso Sanchez-Penalver

                      Comment


                      • #12
                        Richard picked an example with heights where no (standard) transformation would be very nonlinear. The ratio of max height/min height is 1.48, so even the correlation of height and log height will be very high given that the logarithm function is close to linear over a small range. Weights vary by a factor of 5.7 so even with that variable variation is less than an order of magnitude.
                        Last edited by Nick Cox; 13 May 2014, 11:59.

                        Comment


                        • #13
                          Thanks so much for the advice and the link to the archived discussion! I'll check it out. I'll try the categorical approach too.

                          Appreciate it.
                          -Eileen Diaz McConnell (someone who hates to have an internet presence, anywhere and at any time...)

                          Comment


                          • #14
                            dear all,

                            I'd like to follow up on this discussion. I am running a model on count data and therefore I am using poisson-like regression. However, since the strict exogeneity assumption between one regressor and dependant variable is likely to be violated I am trying to use the Pre Sample Mean estimator suggested by Blundell et all (2003). This basically boils down to insert the log of the pre-sample mean of the dependant variable among the regressors. My problem is that I have several values equal to zero in the pre-sample mean and therefore the mean is equal to zero and therefore I cannot compute the log. Any advice on how to correctly transform these data in order to have the log of the presample mean?

                            Thanks


                            Blundell R. Griffith R., Windmeijer F. Individual effects and dynamics in count data models. Journal of Econometrics 108 (2002) 113–131

                            Comment


                            • #15
                              Hello guys,

                              I've a similar issue. In my dataset, some of the observations for my control variables (being inventories and leverage, both variables that can be actually 0 in real life). No i was wondering, if i run the following regression,

                              Code:
                              logit Y X inventory leverage
                              then there is (probably) an issue with my linearity between the continuous predictor variables (inventories + leverage) and their log odds.
                              How should i handle this? (as droppig these variables doesnt seem logical)

                              Comment

                              Working...
                              X