Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Level of bias introduced with a log(x+1) transformation of outcome on my data - Multiple Linear Regression

    I'm attempting to do a multiple linear regression of outcome calcium levels with ~10 predictor variables, one of which is a continuous variable and the rest binary.
    On checking assumptions - normality is not met.
    My data contains 1611 observations with mean 128.95, SD 355.61 and maximum 4930. It contains 768 observations that equal 0. The minimum positive calcium level value is 1.
    A log transformation of outcome fixs the normality issue but obviously does not work on the large majority of the data that equals 0. One of my main aims of the model is explanatory and ease of understanding as it is for medical purposes.

    I've attatched the rvfplot and normality plot below.

    I understand there is alot of controversy around using a log(x+1) transformation. Would it be appropriate to apply this transformation to my data or is there better alternatives?
    Attached Files

  • #2
    Have you considered something like a hurdle model or zero-inflated something-or-other model? Poisson regression (with robust standard errors) has been recommended for use in lieu of logarithmic transformation of outcome variables.

    So, maybe look into a zero-inflated Poisson model.
    Code:
    help zip

    Comment


    • #3
      On checking assumptions - normality is not met.
      If you are doing a linear regression, unless you have a tiny sample size, testing normality is a waste of time. First, there is not, and never has been, any requirement that any explanatory variable be normally distributed. Nor has there ever been any requirement that the outcome variable be normally distributed--that is a widespread misunderstanding. There is a "requirement" that the residuals be normally distributed in order to get exact test statistics. However, linear regression is extremely robust to violations of that "requirement." And unless you are dealing with a very small sample size, the central limit theorem assures that the usual test statistics will have the sampling distributions that would be obtained with normal residuals, so your test statistics all work well any way. Frankly, if your sample size is small enough that non-normality of residuals is an issue, there are no normality tests sufficiently powerful to help you distinguish normal from non-normal anyway. And, for what it's worth, even in a small sample with severely non-normal residuals, your regression coefficients are still unbiased estimates--it is only the standard errors, t/F-statistics, confidence intervals and p-values that are affected.

      What you should be focusing on is whether the relationship you are trying to model is truly linear. If your outcome is not actually linearly related to a combination of the explanatory variables, then any coefficients you get are those of a mis-specified model and they are not useful and may well be misleading. Since logarithm is a non-linear function, if the untransformed variable is, in fact, linearly related to the explanatory variables, then the log transform will not be (and vice versa). Now, if you were dealing with a rather narrow range of values of calcium, the logarithm function would be approximately linear--close enough that both the untransformed and transformed variables are sufficiently linear related. But over the range from 4,390 to 1, the logarithm is strongly non-linear. And, of course, extending down to zero is not possible with logarithms. The trick of adding 1 before taking logarithms is an arbitrary kludge: why not add .00001 or 5x109? The results you get are likely to be sensitive to the choice of what you add. So you should only use any transformation it if linearizes the relationship.

      Better still, as Joseph Coveney suggested in #2, use a different regression model that is better suited to this kind of data in the first place.
      Last edited by Clyde Schechter; 19 Jun 2023, 23:07.

      Comment


      • #4
        Thank you both for your replies. There is a nonlinear relationship Clyde that seems to become linear with a logarithmic transformation (whilst not taking into account the excess of zeros in the data). And a clarification, it is my residuals that are not normally distributed (not my outcome or explanatory variables that i am concerned about). However, it seems from your suggestion that CLT would prevent this being an issue in a reasonably large sample like mine.
        Doing a log change +1 does seem to linearize the residual plot - noting that it is a compeltely arbitary choice.
        I will look into a zero-inflated Poisson model as an alternative for my data analysis.

        Comment


        • #5
          An extra question about the zero-inflated Poisson model I have is:
          From my brief reading this model essentially splits into two parts a logistic model for 0's and a poisson model for non-zeros. It is useful when the data has an excess of 0's (such as mine ~50%)
          I have no theoretical reason to believe that the causes of 0's vs positive calciums are different ie. the change from a calcium of 0 to 1 in theory is likely the same variable effect as 1 to 2 or 400 to 401.

          Is this model still appropriate to use? If so, can i have the same variables used for both the logistic model and the poisson model?

          Comment


          • #6
            From my brief reading this model essentially splits into two parts a logistic model for 0's and a poisson model for non-zeros.
            Not exactly. The model assumes that the data generating process is a mixture model consisting of a subset of observations that necessarily have a zero value of calcium, and a complementary subset whose values have a Poisson distribution including the possibility of a zero value in the Poisson distribution. If the Poisson parameter is large, then, of course, there is a very low, but non-zero, probability of a zero value; if the Poisson parameter is small, then zero might be a very probable result. So it is not a logit model for zeros and a Poisson model for non-zeros. It is a logit model for mandatory zero, and a Poisson model (including some probability of zero) for the rest.

            the change from a calcium of 0 to 1 in theory is likely the same variable effect as 1 to 2 or 400 to 401.
            Well, if that is the case, you are implying a linear model of calcium. But that is inconsistent with your statement in #4 that a log(1+x) transformation linearizes the relationship over a very wide range of values. So at this point I don't know what to suggest.

            can i have the same variables used for both the logistic model and the poisson model?
            Yes. You can have any explanatory variables you like in either model. They can be the same, disjoint, overlapping, whatever. (I am answering this while taking no position on whether the zero-inflated Poisson is an appropriate model for your situation.)

            Comment


            • #7
              Originally posted by Robert Azzopardi View Post
              Is this model still appropriate to use?
              I have to concur with Clyde's agnosticism on that.

              You say that this model is for medical purposes, but at heart a model needs to reflect the data-generating process regardless of your intended audience.

              What do you think gives rise to values distributed as you describe? I recommend allowing that to govern your choice of model.

              Comment


              • #8
                You both make very valid points. Obviously it cannot be a linear relationship if log(x+1) transformation resolves the non linearity to a degree.
                My familiarity with models that are not linear or logistic regression is very limited (half way through my biostatistics course).

                I believe a poisson zero inflated model is the way to go (but am not familar with alternatives nor the model itself). Understanding the stata output is similarly not something i am familar with for the poisson model. I will look into it.

                Nonetheless thank you both for your replies they have been very helpful.

                Comment


                • #9
                  I'd start with an exponential model for the mean of y, including the zeros, estimated using the Poisson quasi-MLE. In other words,

                  Code:
                  poisson y x1 ... xk, vce(robust)
                  The coefficients are easy to interpret as proportionate effects. It doesn't add extra parameters like a zero-inflated Poisson model. Plus, as soon as you use the Poisson ZI model, you now have to take the Poisson distribution seriously. If you just model E(y|x) as an exponential function and use the Poisson log likelihood, the estimates are consistent no matter what.

                  Is y actually a count, or does it take non-integer values? It doesn't matter for Poisson regression, but it might determine a two-part model you might use. You could use a logit to determine y = 0 versus y > 0, and then use Poisson regression for the y > 0 part. Or gamma regression (using the glm command). They you wouldn't be making distributional assumptions. For E(y|y > 0,x), use

                  Code:
                  glm y x1 ... xk, fam(gamma) log(link) if y > 0, vce(robust)
                  Then E(y|x) = P(y = 1|x)*exp(x*b).

                  Comment


                  • #10
                    Thanks Jeff, y does not take non-integer values and can be investigated as a count (although in reality is a continuous variable - results are discrete however). Just for my own learning, would there not be concern that the data is not well represented by the poisson quasi-MLE? For instance in my data there is ~50% 0's and the rest of the numbers ranging up to 4000. Would you not be concerned that the data becomes heavily left skewed with a mean from the poisson model that is not representative of the effect seen on the >0 side of the model? Or is there something I'm missing

                    Comment


                    • #11
                      When I complete a poisson regression model either on the data as a whole or on only the positive data and then complete a estat GOF test using the below

                      Code:
                        
                       poisson y x1 ... xk, vce(robust)  
                      estat gof  
                      poisson y x1 ... xk if y>0, vce(robust)
                      estat gof
                      I find that both models are significant suggesting the poisson model does not describe my data well. I assume this is concerning and that I should look into my data to find a better fitting model?

                      In my mind one of the more obvious ways to model data would be a two part model with as you have said y=0 vs y>0 and then a linear regression model with a log transformation of y (as this linearizes the residuals), is this approach valid?
                      Last edited by Robert Azzopardi; 21 Jun 2023, 17:48.

                      Comment


                      • #12
                        Graph.gph

                        Essentially my data looks like this - any ideas on best way to model

                        Comment


                        • #13
                          Originally posted by Jeff Wooldridge View Post
                          . . . as soon as you use the Poisson ZI model, you now have to take the Poisson distribution seriously.

                          . . . You could use a logit to determine y = 0 versus y > 0, and then use Poisson regression for the y > 0 part. . . . They you wouldn't be making distributional assumptions.
                          Thank you, I wasn't aware of that.

                          OP mentions "medical purpose", so perhaps the dataset comes from some calcium fluorescence assay in cytometry or algorithmic detection of calcification in medical imaging (number of pixels with intensity above some multiple of background noise in a region of interest drawn over a solid tumor or a stenotic artery segment). Regardless, I’m guessing that zeroes in the data aren’t really zero, but rather reflect some measurement threshold. Would the measurement-error aspect of the data-generating process affect the choice between the zero-inflated and two-stage approaches?

                          The question about how to best to accommodate values below the limit of detection (quantitation) has come up in other biological fields, for example, pharmacokinetics.

                          Here with OP's dataset, the two-stage approach (gamma distribution family and log link function) and a zero-inflated negative-binomial model happen to be nearly identical in performance. But would there be a reason for preferring one over the other if 0 isn't qualitatively so different from 1 (or 2 or . . .)?
                          Code:
                          version 18.0
                          
                          clear *
                          
                          copy "https://www.statalist.org/forums/filedata/fetch?id=1717717" ///
                              `c(pwd)'/rvf.gph
                          
                          graph use rvf, nodraw
                          
                          serset use
                          
                          quietly generate int out = round(__000000 + __000001)
                          assert mi(__000000) & mi(__000001) if mi(out)
                          quietly drop if mi(out)
                          
                          egen double rnk = rank(__000001)
                          logit out c.rnk, nolog
                          predict double pr1, pr
                          
                          foreach dist in poisson gaussian nbinomial {
                              quietly glm out c.rnk if out > 0, family(`dist') link(log)
                              display in smcl as text _newline(1) " `dist': AIC = " as result %5.1f e(aic)
                          }
                          
                          glm out c.rnk if out > 0, family(gamma) link(log) nolog
                          predict double mu, mu
                          generate double exo = pr1 * mu // "E(y|x) = P(y = 1|x)*exp(x*b)"
                          
                          lowess out rnk, mcolor(black) msize(tiny) lineopts(lcolor(blue)) ///
                              title("") note("") ytitle(Calcium Something) ///
                                  xtitle(Rank of Former Linear Predictor) ///
                              addplot(line exo rnk, sort lcolor(red)) ///
                                  scheme(s2color) ylabel( , angle(horizontal) nogrid) legend(off)
                          sleep 5000
                          
                          zinb out c.rnk, inflate(c.rnk) nolog
                          predict double n, n
                          graph twoway ///
                              line exo n, sort lcolor(red) || ///
                              line n n, lcolor(blue) ///
                              xtitle("ZINB Prediction") ytitle("Logit | Gamma GLM Prediction") ///
                                  scheme(s2color) ylabel( , angle(horizontal) nogrid) legend(off)
                              
                          erase `c(pwd)'/rvf.gph
                          
                          exit
                          Last edited by Joseph Coveney; 21 Jun 2023, 23:36.

                          Comment


                          • #14
                            Robert: It's very important to understand that Poisson regression is not assuming a Poisson distribution. So the estat gof outcome is irrelevant. All that matters is that E(y|Xx is well approximated by an exponential function. You can do a version of RESET to test that, but I wouldn't worry about modeling every little nonlinearity.

                            Joseph has provided two good options. I'm not too surprised they give similar answers. If y is not a count, I'd probably lean toward the two-part model with a gamma quasi-likelihood in the second step. It only requires you to have E(y|x) to have an exponential form. The gamma distribution can be arbitrarily wrong.

                            Because of the different robustness features of the different methods, AIC is not what I would use to choose. Only if you're actually trying to model the entire distribution does using AIC across different methods based on the log likelihood generally make sense.

                            Comment

                            Working...
                            X