Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • log transformation of a ratio variable with small values

    I log-transformed a ratio variable (sem), which ranged from 5.73e-10 to .0001021 (mean: 1.72e-06, s.d.: 2.15e-06).

    Variable | Obs Mean Std. Dev. Min Max
    -------------+--------------------------------------------------------
    sem | 9176 1.72e-06 2.15e-06 5.73e-10 .0001021

    To prevent negative numbers after log transformation, I added 1 to the original variable like below.
    gen lsem=log(sem+1)

    But, the log transformation was failed as follows. All values remained unchanged.

    Variable | Obs Mean Std. Dev. Min Max
    -------------+--------------------------------------------------------
    sem | 9176 1.72e-06 2.15e-06 5.73e-10 .0001021
    lsem | 9176 1.72e-06 2.15e-06 5.73e-10 .0001021

    What was wrong? How can I log-transform this variable?



  • #2
    If you expand log(1+x) in a Taylor's series you get x - x2/2 + x3/3 - x4/4 ... Because the numbers you are using are all smaller (most of them a great deal smaller) than 10-3, the quadratic and higher order terms are all smaller than 10-6. So, to at least 6 decimal places, log(1+x) is the same as x. In fact, this approximation is often exploited to speed up calculations or simplify equations in many contexts.

    May I ask why you want to eliminate negative values for log(x)? How will that be a problem for you?

    Comment


    • #3
      Schechter, thank you very much for your kind answer. Two reasons made me do this. One was that the initial log-transformation resulted in all negative values. Another was that I log-transformed this variable in order to make interaction terms with other variables and positive values seemed better for interpreting interaction effects. Specifically, before adding 1 (i.e., log(sem+1), I log-transformed as below.

      gen lsem=log(sem)
      sum lsem
      Variable | Obs Mean Std. Dev. Min Max
      -------------+--------------------------------------------------------
      lsem | 9176 -13.87985 1.417644 -21.27972 -9.189582

      The problem was that all values were negative although the original values are all positive. Is it better to use this? It seemed reasonable and the result showed a normal distribution (attached). But, I am hesitating to use it because of the above reasons.

      Attached Files
      Last edited by Sang-Bum Park; 16 Feb 2019, 16:54.

      Comment


      • #4
        Well, yes, the logarithms will all be negative because the original values are all less than 1. But that is not a problem at all.

        I wouldn't describe the distribution of log(x) that you show as normal. It's probably closer to normal looking than that of x itself. But that then raises another question: why do you want this variable to have a normal distribution.

        There is a widespread erroneous belief that variables need to have normal distributions to use them in linear regressions. It is still even widely taught to day although it is completely wrong. On the outcome variable side, the most that one can say of this nature is that the residuals, not the outcome variable itself, need to be normally distributed in order for the p-values and t-tests to work properly in a small sample. But if your sample is has 9,716 observations the central limit theorem will rescue the t-tests and p-values from all but the most extreme violations of the normality of residuals. And on the predictor variable side there are no distributional requirements of any kind at all.

        So really, the only reason you should be thinking about log-transforming this variable is if there is reason to believe that log x is linearly related to your outcome variable but x itself is not (or if x is the outcome variable that log x is linearly related to the predictor(s) but x itself is not).

        Comment


        • #5
          I am very grateful for your excellent and very helpful answer: it is normal that values smaller than 1 are changed into negative values after log-transforming and the residuals of a dependent variable need to be normally distributed and this can be exempted in a large sample. Okay, I will use original values without a log-transformation. I don't know whether results are different according to log-transforming. Thank you!

          Comment


          • #6
            You haven't addressed whether the relationship (conditional on the other predictors) is more nearly linear after transformation. With a range of values over 6 orders of magnitude from 5.73e-10 to .0001021 on the original scale I would not be surprised at outliers that exert considerable leverage. Added variable plots should help you decide.
            Last edited by Nick Cox; 17 Feb 2019, 09:33.

            Comment


            • #7
              Excellent comments by Clyde Schechter and Nick Cox, as usual. I just want to add that nothing has been said about what model is trying to be estimated, or if he's just using OLS for this. Now, one thing he may want to consider is that sem seems like a fractional response variable, since all of its values are between 0 and 1. Although somewhat appropriate if you want to do predictions at the means, OLS is not appropriate if you're trying to do predictions at other values, particularly far away from the mean, because there is nothing stopping it from predicting a value less than 0 or greater than 1. An alternative is to use glm with a binomial family and a logit link, as indicated in Papke and Wooldridge (1993), and Papke and Wooldridge (1996). I cite both, because the first one extends the second in covering cases where we need to work with the total cases in each proportion.

              References:
              Papke, Leslie E. and Jeffrey M. Wooldridge. 1993. Econometric Methods for Fractional Response Variables with an Application to 401(K) Plan Participation Rates. NBER Technical Working Paper 147.

              Papke, Leslie E. and Jeffrey M. Wooldridge. 1996. Econometric Methods for Fractional Response Variables with an Application to 401(K) Plan Participation Rates. Journal of Applied Econometrics 11(6): 619-632.
              Alfonso Sanchez-Penalver

              Comment


              • #8
                Alfonso Sánchez-Peñalver I read #3 as implying that this is a predictor. If not, it could hardly appear in interaction terms.

                Comment


                • #9
                  Nick Cox oh you are right, but if that's the case why do we care if it is normally distributed? The only reason to log-transform it would be if the relationship with the explained variable was monotonically nonlinear. Multiplying the variable by 100 to have it in percent form could be an option there, but for values that are very small they will still remain negative when log-transformed.

                  The discussion about the distribution of the variable is what made me think that it was the explained variable. Sorry about that.
                  Alfonso Sanchez-Penalver

                  Comment


                  • #10
                    It doesn't have to be normally distributed. But marked skewness on the original scale -- which is established by #3 -- is often in practice associated with difficulties in establishing a linear relationship. Like you, I would underline that curvature is often reduced by taking logarithms.

                    If the OP wants products of this and other variables, the problems could be compounded.

                    Normality isn't a goal, but saying that is no denial that approximate symmetry of distributions often makes analyses easier, if only if for the simpler behaviour of relationships they often imply.

                    I assume that in #1 the OP meant a ratio calculated from some numerator/some denominator. Log scale is often a natural choice for ratios given log ratio = log numerator - log denominator

                    Comment


                    • #11
                      Hi Nick, I agree with what you're saying, but I was just saying that it is not necessary particularly if the relationship really seems to be linear.

                      If you have the values of the numerator and the denominator... why not enter each separately log-transformed? That way you're not restricting the coefficient on one of the variables being the exact same value as the one in the other variable but with a negative sign, thus allowing you to test whether the log of the proportion is appropriate or you may have a misspecification. Something that can be tested with a simple t-test. It also allows the interaction of both variables to have a different effect. Of course if you don't have the two variables to do this it won't be possible.
                      Alfonso Sanchez-Penalver

                      Comment


                      • #12
                        I am very grateful for your (Clyde Schechter, Nick Cox, and Alfonso Sánchez-Peñalver) excellent answers. For clarity, I detail the numerator and denominator of my variable (sem). Although the data differs from the above one (obs. 9176), variables are the same. This is a part of the (real)dataset that I am working on. A summary statistics of sem is as follows:

                        Variable | Obs Mean Std. Dev. Min Max
                        -------------+--------------------------------------------------------
                        sem | 1316 .0010682 .0010715 5.73e-07 .0081791

                        sem is a ratio variable and the denominator (den) is very large while the numerator (num) is (relatively)small.

                        Variable | Obs Mean Std. Dev. Min Max
                        -------------+--------------------------------------------------------
                        num | 1357 7588.634 12467.37 11 101970
                        den | 1333 1.28e+07 2.11e+07 261737.8 2.29e+08

                        If sem is log-transformed as a ratio (log(sem)), then the log variable (lsem) is as follows:

                        Variable | Obs Mean Std. Dev. Min Max
                        -------------+--------------------------------------------------------
                        lsem | 1316 -7.546429 1.6295 -14.37197 -4.806176

                        As Clyde Schechter explained, all transformed log values were negative because a ratio is smaller than 1. He suggested me to use it. I call it as an option 1.

                        If I use log-transformed num and log-transformed den (log(num)/log(den)), then the result (lsem2) is as follows:

                        Variable | Obs Mean Std. Dev. Min Max
                        -------------+--------------------------------------------------------
                        lsem2 | 1316 .5186528 .0909031 .1746379 .6448187

                        This is informed by Nick Cox. I call it as an option 2.

                        The correlation between lsem and lsem2 is very high (0.94) and significant (p<0.001).

                        In addition, I want to raise another issue, that is, a high correlation between num and dem (0.76, p<0.001).

                        My questions are twofold.
                        First, can I use option 2 instead of option 1? Or should I follow option 1?
                        Second, is there any problem in using a ratio with a high correlation between a numerator and a denominator?

                        Thank you all very much.

                        Comment


                        • #13
                          I did not suggest using log(num)/log(den), which is malformed. I just pointed that log(num/den) = log num - log den. I'd advise revision of your textbook or course notes on logarithms. So, Option 2 is a misreading and not a good idea.

                          Comment


                          • #14
                            I see no problem in correlation of numerator and denominator. For example, consider a bunch of countries and numerator number of Stata users and denominator number of intelligent population. You'd expect a correlation; much of the point of a ratio is to adjust for it.

                            I mostly agree with Alfonso but I think you need a substantive reason to enter numerator and denominator separately as predictors.

                            Notice that we're in the dark here even on what this ratio is. The substance matters, as a ratio could be anything from a standard measure that people in your field are comfortable thinking about as a well-defined predictor to something more esoteric or ad hoc.

                            Comment


                            • #15
                              Nick Cox Thank you for your answer. I misunderstood your point and find that the difference between log numerator and log denominator does equate with log ratio (num/den). Specifically, I revised lsem2 of #12 as follows.

                              gen lsem2=log(num)-log(den)

                              This variable is the same with lsem (og(num/den)). I also appreciate your kind explanation on my second question.

                              Comment

                              Working...
                              X