Correlation between two categorical variables

Jessica Berrett

Join Date: Sep 2019

Posts: 57
#1

Correlation between two categorical variables

22 Mar 2021, 11:03

I have two questions regarding the correlation between two categorical variables.

1) If using Cramer's V, I know that you can interpret the strength and direction of the relationship. However, how do you interpret say a positive relationship between two categorical variables that are both yes/no or 1/0 variables?

2) I just want to confirm if the two categorical variables have two levels each (yes/no or 1/0), is Cramer's V appropriate or should I be using a different test?

Thank you!
Tags: None
Bruce Weaver

Join Date: May 2014

Posts: 1119
#2

22 Mar 2021, 11:23

There are many measures of association one could use for a 2x2 table. Here are some common ones:
Risk ratio

Odds ratio

Risk difference

Phi coefficient (i.e., Pearson r computed on two dichotomies)

Which one(s) you choose will likely depend on the context, including things like the discipline, whether one variable is an outcome and the other explanatory (vs simple association between variables with no such clear roles).

HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#3

22 Mar 2021, 11:24

Cramér's V is a measure of association corresponding to a chi-square test. If you want a test, use the latter or Fisher's exact test. The orthodox position seems to be that the latter is more focused on the specific problem but I've seen push-back against that. A different kind of problem is that the chi-square test is always easily computable but that's not necessarily true of FIsher's exact test.

If you have two binary variables, the sign of any relationship just depends on conventions about which state is coded 0 and which 1. There is a grey area between a convention being natural and it being familiar. If anything is even a smidgen towards being causal, it seems usual to code both binaries to yield positive association. So being a smoker and getting lung cancer would be both be coded 1, and their opposites 0, but I guess associating with other code choices would at worst be thought awkward rather than wrong. And there are plenty of negative associations too.
Comment
Jessica Berrett

Join Date: Sep 2019

Posts: 57
#4

31 Mar 2021, 08:58

I so appreciate all of your responses this is very helpful! As a follow up question, could I use Cramer's V if I have one variable that has two categories and another variables that has four categories? So a 2x4. If not, what test of association would be appropriate?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#5

31 Mar 2021, 09:02

Again, as a test of association chi-square and Fisher's test could both be used. If the 4-category variable is ordered, there are more tests on offer.
Comment
Jessica Berrett

Join Date: Sep 2019

Posts: 57
#6

31 Mar 2021, 10:49

Yes, I understand the chi-square and Fisher's test can be used. However, my student is wanting to follow up on that to assess the strength and direction of the relationship.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1119
#7

31 Mar 2021, 11:47

Here are some notes that may be helpful.
https://polisci.usca.edu/apls301/Tex...ssociation.htm

https://blog.zenggyu.com/en/post/201...-to-use-which/

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#8

31 Mar 2021, 12:00

Seems to me that your resources include non-parametric methods books and introductory categorical data analysis books. You may have to shop around to find which is most congenial.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#9

31 Mar 2021, 14:01

The mention of "direction" here would imply that the 4 category variable is ordered, and 2 categories are always ordered, so if one of these variables is regarded as explanatory and the other as response, I would strongly recommend Somers' D, about which see -ssc describe somersd- D is a measure of association, for which a test and CI exist. If there is no such explanatory/response distinction, I'd use Goodman and Kruskal's gamma, -tabulate Y X, gamma-, which is the original statistic on which D was based. If both variables are nominal, and the explanatory/response distinction holds, I'd strongly recommend Goodman and Kruskal's tau, which is known (but not well) to be an explained variation measure based on Simpson's measure of nominal variation. There's no Stata program for G & K's tau but it's only mildly tedious to calculate by hand, and can be aided by -ssc entropyetc-

The locus classicus is:
Goodman, L. A., and W. H. Kruskal. 1954. Measures of association for cross classifications. Journal of the American Statistical Association 49: 732–764.
Sociological statistics texts in the 1960s and 70s commonly treated the G & K statistics.

Finally, I'd note that there is also a ordinal X nominal measure of association, similar in concept to G & K's tau, about which see the article cited in my -r2o- package, -ssc describe r2o-.
1 like
Comment

Announcement

Correlation between two categorical variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment