Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Do logs modify the correlation between two variables?

    Dear Statalists,

    I am applying logs to two variables:

    Code:
    gen In_Arg_X_Bra=ln(Arg_X_Bra+1)
    gen In_Chn_X_LAC=ln(Chn_X_LAC+1)
    And then running:

    Code:
    pwcorr In_Arg_X_Bra In_Chn_X_LAC,sig
    Before -In- the correlation of the same two variables is 0.03 and after -In- it is 0.36 (both significant p = 0.00). I need the two variables not to be correlated since I am running a -ivregress gmm- regression where Arg_X_Bra is the dependent variable while Chn_X_LAC is the instrumental variable (IV). This is true, but not when transforming to log, making my IV useless.

    How is this possible? Could this be related to the fact that I am using ln(X+1) as a way of getting around the impossibility of calculating ln(X) when X = 0? I have many 0 values due to zero trade between countries. In that case, I welcome any suggestions.

    Thanks a lot,
    Giuliana.
    Last edited by Giuliana Moroni; 20 Feb 2019, 07:43.

  • #2
    Taking logarithms is not a linear transformation, so you would expect correlations to change. The only exception would be where the correlation had magnitude 1 in the first instance. Adding 1 before transformation doesn't make that effect worse or better.

    Drawing scatter plots of your data before and after transformation should surely illuminate what changes.

    Comment


    • #3
      Giuliana, as a side note, I would remark on your use of "started logs," that is, adding 1 or some other small constant to a variable before logging it to handle 0 values. I'd discourage you from doing that: Some years ago I had done that in an analysis, and a journal reviewer complained, saying that such a practice gives undependable results. S/he suggested trying different small constants (e.g., 0.01, 0.1, 1.0) and seeing if the regression results varied. They did, by quite a bit. Instead, in that circumstance I used a sqrt() transformation instead of ln(), which was of course fine in handling 0 values, but also had the happy result of noticeably improving the fit of my model.

      Comment


      • #4
        log(1 + x) is actually quite conservative. A good reason for being queasy about it is supposedly what it means for elasticity but if you have zeros in your data then the relevance of elasticity as a governing concept seems moot in any case.

        I'd warn against a common misreading. Any intuition that log(x + smidgen) is close to log(x) is true only for large positive x. If x is zero and smidgen << 1 then log(x + smidgen) is large and negative and will create outliers out of zeros! Otherwise put log(1 + x) is closer to log(x) for small x than log(smidgen + x) for smidgen << 1.

        I have to wonder what is taught where about logarithms. Don't they feature in pre-calculus or equivalent courses in late high school or early college courses as background for anything quantitative?

        I'll plug also cube roots as helpful for quantities that can be positive, zero or negative (e.g. profit/loss).
        Last edited by Nick Cox; 20 Feb 2019, 09:27.

        Comment


        • #5
          A footnote to #2: there are some other cases where taking logarithms will not change a correlation. But with most data they will.

          Comment


          • #6
            I believe Stata tip 96 by Nick Cox (in 119 Stata tips, 3rd ed, edited by Nick Cox and H.J. Newton) discusses using cube roots (which are defined for 0's and negative numbers) instead of logs when there are zeros or negative numbers in the data. As Mike points out, square roots also may work if the problem is just zeros and not negative numbers.

            Comment


            • #7
              Thank you very much all, agree on that ln(X+1) is not the best option and it does bias my results. I have tried sqrt () and did work much better since my problem is just zeros (not negative numbers).

              However, I haven't seen a sqrt() transformation being used in the trade literature when dealing with zero trade values, especially in gravity models.

              For instance:
              Estimating gravity equations: to log or not to log?
              Fitting the Gravity Model when Zero Trade Flows are Frequent: a Comparison of Estimation Techniques
              They conclude that MPML/PQML might be a more appropriate estimator in this case. I'm not working with a gravity model but all of my variables represent trade flows -y x1 x2 x3 z-

              Also, sqrt () wouldn't allow me to show percent change or multiplicative factors. I would appreciate hearing your opinion on this.

              Comment


              • #8
                ln(1 + X) works quite well for many purposes. I wasn't criticising it and I don't endorse your criticism. My point is that indeed it is greatly preferable to say ln(1e-6 + X) or ln(1e-9 + X).

                I have not studied your links and am not well up on your literature, but I note that generalised linear models with log links don't transform the response and don't assume that data are all positive even when that link is being used.

                Comment


                • #9
                  Thanks a lot Nick, I'll follow your advice

                  Comment

                  Working...
                  X