Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • taking log of var turning 0 into missing value

    Hi everyone,

    I have an outcome variable of 'expenditure' which I believe may be non-linear in nature so I am running a log - linear regression after taking log of the variable. However, the values under total expenditure which are 0 are transformed into missing values (.), which of course makes sense as you can't take log of 0. I am just unsure of what I should do in this situation? Should I just run the regression as though the variables are missing even though they actually are not? Should I replace them with 0?

    After running the regression with the values as missing, the coefficient I got was highly significant with a p-value = 0! However, I am not sure if this should be concerning to me as I don't know whether it's realistic for me to be getting a p-value equal to 0 when I am using real-world data with more than 10,000 households? I wonder if my regression is not reliable because of the missing values in my data (which shouldn't be missing values, as I mentioned earlier?)

    TIA!
    Last edited by Nikita Shukla; 03 Mar 2022, 16:20.

  • #2
    You are right to be concerned, and you should not log-transform a variable that takes on zero values. And replacing the missing values with 0 is not appropriate either. While there are several approaches that people recommend (including some I dislike), I think the best solution here is to do Poisson regression on the expenditure variable, and use robust (or cluster robust, if this is panel data) errors. It does not matter that the expenditure variable takes on non-integer values.

    Comment


    • #3
      I agree with Clyde Schechter. The main point here is about Poisson regression (or the same rose under any other name, say generalised linear models with logarithmic link) is that the mean function being treated as positive -- and so capable of being mapped to logarithms and then back again -- is consistent with some values being zero (or even negative).

      The problem with (to use Stata syntax)

      cond(y == 0, 0, ln(y))

      is that it treats y == 0 as if it were y == 1, which is inconsistent (and also arbitrarily dependent on choice of units -- whether dollars, thousand dollars, or whatever else).

      For visualization I am quite sympathetic to the different idea of using log(y + 1) as making it possible to show zeros on a graph as well as pull in skewed distributions to make thinking easier, but for modelling that is rarely more than an awkward compromise. It can be a plus point that log(y + 1) behaves like y for small positive y and like log y for large positive y, as its series expansion makes clear. The more general idea of log(y + c) raises even more difficulties.

      Comment


      • #4
        Cross-posted at https://stats.stackexchange.com/ques...th-values-of-0

        Please note our policy on cross-posting, which is that you are asked to tell us about it. https://www.statalist.org/forums/help#crossposting

        Comment


        • #5
          I have seen many papers that did a probit on the zero-nonzero distinction and then a log-log regression on the nonzero observations. You could also consider a -tobit- regression.

          Comment


          • #6
            Thanks for the suggestions! And sorry about the cross-posting, I will keep it in mind next time! Just a quick question about using a Poisson regression, I read that it is useful when using count data so I just want to clarify whether I can still use it here? as expenditure is a continuous variable?

            Comment


            • #7
              No doubt the approach mentioned by Daniel Feenberg is competitive for some kinds of thinking about the problem. If the variable is weekly expenditure on tobacco, my expenditure is a structural zero and explaining why and I and more crucially others behave similarly could be interesting and important. If the variable is weekly expenditure on alcohol, my expenditure might be a sampling zero, but there would be others the other way round.

              I suppose the issue comes down to one of methodology, whether the goal of modelling is to get closer to a handle on behavio[u]r as well as summarization of outcomes.

              Comment


              • #8
                The question raised in #6 is fully understandable in light of how Poisson distributions/models are often taught.

                For purposes of modeling nonnegative but not necessarily integer outcomes, I suggest a new name, ZAPIT regression, where the ZAP stands for "zeros and positives."
                Code:
                zapit y x1 x2, vce(robust)
                It is the same as
                Code:
                poisson y x1 x2, vce(robust)
                except that it does not return the (in my opinion paternalistically annoying) note:
                Code:
                note: you are responsible for interpretation of noncount dep. variable.
                Old wine in new skins, yes, but perhaps a useful way to disconnect "Poisson" from "integer count."

                Comment


                • #9
                  My take on #6 is (I hope) now fairly standard.

                  0. Bill Gould's blog posting remains a nice way into this. https://blog.stata.com/2011/08/22/us...tell-a-friend/

                  1. The name Poisson regression is just a name. It's pretty silly as a name as Poisson regression has nothing much to do with Poisson and it's arguable that any use of the Poisson distribution as a reference distribution is at best secondary -- it is not as if plain or vanilla regression is called Gaussian or normal regression (there is now Gaussian process regression -- set that on one side). But a name can stick and become a widely used handle, as here. At least Poisson is just two syllables and sufficient to remind us of some historical roots.

                  2. I'd suggest that the main idea is just to think y = exp(Xb). All else is detail and at choice. As said, to many people this idea has been standard since at least 1972 as generalized linear models with logarithmic link. Conversely, that is a bit of a mouthful.

                  Comment


                  • #10
                    John Mullahy "paternalistically": why not maternalistically?

                    (Purely personal, at one level, but perhaps shared too: my father had a temper, but if my mother told us or warned us off, she was almost always correct that we were wrong or taking a risk.)

                    Comment


                    • #11
                      Re: #10: Point well taken. "parentalistically" is better.

                      Comment

                      Working...
                      X