Correlation between binary and continuous variables

Lijuan Xie

Join Date: Nov 2019

Posts: 8
#1

Correlation between binary and continuous variables

08 Feb 2020, 12:51

Hello everyone!

I have a bunch of binary variables, such as GENDER, and some contiuous variables like INCOME. I wonder whether I can still use "pwcorr" in STATA to see the correlation. Or when dealing with these different types of variables, is it reasonable to run the correlation? Thank you!

BR
Lijuan
Tags: None

1 like
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#2

08 Feb 2020, 13:22

You can use -pwcorr- to calculate correlations between dichotomous or ordinal variables and continuous variables The question is really whether you want to or not. -pwcorr- calculates the Pearson correlation coefficient, which has the advantage of being familiar to almost everybody who has taken an introductory statistics course, and even to a lot of people who haven't.

When used to correlate a dichotomous variable and a continuous variable, it is actually equivalent to doing a t-test on that continuous variable over the dichotomous one. (The t-statistic and p-value will come out the same).

When used to correlate a polytomous ordinal variable with another ordinal variable or a continuous variable, it is harder to give a simple or natural interpretation, but one still has the sense that it is telling you in some way the extent to which larger values of one variable are associated with larger values of the other. So you can certainly use it in that way. Interpreting the corresponding t-statistics or p-values is a bit dicier as it isn't entirely clear what the null hypothesis or an alternative hypothesis might mean.

If you want to do something a bit fancier, and if your audience is sophisticated enough to grasp it, you can use biserial, polyserial or polychoric correlations instead. I don't think there is a native Stata command to calculate these, but Sergiy Radyakin wrote a program -polychoric-, which, I believe, is still available on SSC. The concept is to treat the ordinal or dichotomous variable as being a discrete observed counterpart to a continuous, normally distributed latent variable, and to estimate the Pearson correlation between that latent variable and the continuous variable. This is clearly a bit more complicated than an ordinary Pearson correlation, and probably will baffle audiences without statistical training. It also has a few drawbacks: the estimation is an iterative process that sometimes fails to converge. There is no simple phrase that explains what it is to the uninitiated. And there are contexts in which the notion that a dichotomous variable is a discrete observable driven by a normally distributed latent variable doesn't seem right: e.g. male vs female, or foreign vs domestic.

Personally, I don't use these more complicated correlations much: primarily I use them when working with people who really understand them and ask to see them. But you can consider them, depending on your circumstances.

Correction: I misattributed -polychoric- to Sergiy Radyakin. It was actually developed by Stas Kolenikov. And it is not available on SSC: you can get it from polychoric from http://staskolenikov.net/stata. Sorry for the confusion.

Last edited by Clyde Schechter; 08 Feb 2020, 13:25.
2 likes
Comment
Lijuan Xie

Join Date: Nov 2019

Posts: 8
#3

08 Feb 2020, 13:38

Dear Clyde,

thank you so much for your such a quick and detailed response!! It gives me much information and inspiration. I really appreciate it! Thank you!

Best regard
Lijuan
Comment

Announcement

Correlation between binary and continuous variables

Comment

Comment