correlation vs. logistic regression

Nina Lasek

Join Date: Mar 2017

Posts: 2
#1

correlation vs. logistic regression

30 May 2018, 12:26

Using stata, I am analysing a data set with a dichotomous independent and a dichotomous depenent variable. Persons r only is r = 0.047. However, if I use the same variables in a logistic regression the logit-coefficient is 1.123, which gives an exp(b) of 3.074. So, based on the bivariate correlation one might conclude the relationship is very small, based on the bivariate logistic regression one might conclude the relationship between the variables is substantial. Is this a known phenomenon? One characteristic of the data is that in both the dependent and the independent variable the categories are very uneven - 98% zeros, 2% ones. Can anybody help me out here in understanding what might be going on? Thank you very much Nina
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

30 May 2018, 12:43

This is not surprising, and it is definitely related to the imbalanced distributions. If you start out with a probability of approximately 98% for y = 1 when x = 0, that corresponds to an odds of .98/.02 = 49. If the odds ratio associated with x = 1 vs 0 is 3.074, then the odds of y when x = 1 is 150.626. Converting that to a probability you get 0.993. Notice how the huge odds ratio corresponds to a minuscule change in the probability when the base probability is close to 1. The correlation, however, does not include this magnification effect.

To convince yourself you're doing everything right here, skip the fancy logistic regression and just run

Code:

tab x y

and calculate the odds ratio directly from the table.
Comment
Amin Sofla

Join Date: May 2018

Posts: 67
#3

30 May 2018, 12:44

Duplicate response:
I will leave the comprehensive answers to others.
In simple words, I think that the Pearson shows the linear correlation between variables while the logit reports the non-linear relationship.
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 537
#4

31 May 2018, 05:34

If you can assume a latent bivariate normal distribution for your variables (your observed variables result from dichotomization of the latent variables at respective thresholds), you can calculate tetrachoric correlation coefficients that in your case will be substantial higher than Pearson correlations:

Code:

clear input y x f 0 0 2 1 0 98 0 1 5 1 1 753 end tab2 y x [fw=f], V logistic y x [fw = f] tetrachoric y x [fw=f]

A good reading explaining why your Pearson correlation is low if the thresholds of both variables are extreme is: MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7(1), 19-40.
1 like
Comment
Amin Sofla

Join Date: May 2018

Posts: 67
#5

31 May 2018, 10:39

Dirk Enzmann : Thanks for the informative response.
Comment

Announcement

correlation vs. logistic regression

Comment

Comment

Comment

Comment