Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • correlation vs. logistic regression


    Using stata, I am analysing a data set with a dichotomous independent and a dichotomous depenent variable. Persons r only is r = 0.047. However, if I use the same variables in a logistic regression the logit-coefficient is 1.123, which gives an exp(b) of 3.074. So, based on the bivariate correlation one might conclude the relationship is very small, based on the bivariate logistic regression one might conclude the relationship between the variables is substantial. Is this a known phenomenon? One characteristic of the data is that in both the dependent and the independent variable the categories are very uneven - 98% zeros, 2% ones. Can anybody help me out here in understanding what might be going on? Thank you very much Nina



  • #2
    This is not surprising, and it is definitely related to the imbalanced distributions. If you start out with a probability of approximately 98% for y = 1 when x = 0, that corresponds to an odds of .98/.02 = 49. If the odds ratio associated with x = 1 vs 0 is 3.074, then the odds of y when x = 1 is 150.626. Converting that to a probability you get 0.993. Notice how the huge odds ratio corresponds to a minuscule change in the probability when the base probability is close to 1. The correlation, however, does not include this magnification effect.

    To convince yourself you're doing everything right here, skip the fancy logistic regression and just run

    Code:
    tab x y
    and calculate the odds ratio directly from the table.

    Comment


    • #3
      Duplicate response:
      I will leave the comprehensive answers to others.
      In simple words, I think that the Pearson shows the linear correlation between variables while the logit reports the non-linear relationship.

      Comment


      • #4
        If you can assume a latent bivariate normal distribution for your variables (your observed variables result from dichotomization of the latent variables at respective thresholds), you can calculate tetrachoric correlation coefficients that in your case will be substantial higher than Pearson correlations:
        Code:
        clear
        input y x f
        0 0 2
        1 0 98
        0 1 5
        1 1 753
        end
        
        tab2 y x [fw=f], V
        logistic y x [fw = f]
        tetrachoric y x [fw=f]
        A good reading explaining why your Pearson correlation is low if the thresholds of both variables are extreme is: MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7(1), 19-40.

        Comment


        • #5
          Dirk Enzmann : Thanks for the informative response.

          Comment

          Working...
          X