Chi square test or ANOVA?

Svetlana Bondar

Join Date: Nov 2019
Posts: 30

Chi square test or ANOVA?

15 Dec 2019, 13:44

Hi! I have two variables: doctors and their smoking status (smoking or don't smoking). I want to find out if doctors of different specialties differ in smoking status. What should I use in this case? Chi square test or ANOVA? Help please! I am confused about this simple question.

HTML Code:

[CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input long(speciality smoking_status)
1 2
1 2
1 1
1 1
2 1
2 1
2 2
2 2
3 1
3 1
3 1
3 1
4 2
4 2
4 2
4 2
5 1
5 2
5 1
5 1
end
label values speciality speciality
label def speciality 1 "cardiologist", modify
label def speciality 2 "endocrinologist", modify
label def speciality 3 "internal_medicine", modify
label def speciality 4 "surgery", modify
label def speciality 5 "traumatology", modify
label values smoking_status smoking_status
label def smoking_status 1 "no", modify
label def smoking_status 2 "yes", modify
[/CODE]

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29910
#2

15 Dec 2019, 13:55

Code:

tab speciality smoking_status, chi2

You have a polytomous variable as your "exposure" and a dichotomous variable as your "outcome" so this is a classic situation for a chi square test. ANOVA is really meant to be used with continuous outcomes. You can do this with ANOVA, and the resulting p-value will be not all that much different from what you get with chi-square, but in general they will not be exactly the same. Neither is perfect. With a dichotomous outcome variable, the homoscedasticity assumption for ANOVA is not met. And the chi-square test is, in any case, an approximation, though a good one unless you have some very small cells, in which case the Fisher "exact" test might be better. In your example data the overall sample size is very small and there are some zero combinations (all your surgeons are smokers and none of your internists are.) The presence of zero cells like that degrades the quality of both the ANOVA and the chi square approaches. But I'm imagining that in your full data set you won't have this problem.

Last edited by Clyde Schechter; 15 Dec 2019, 13:58.
Comment
Svetlana Bondar

Join Date: Nov 2019

Posts: 30
#3

15 Dec 2019, 14:09

Originally posted by Clyde Schechter View Post

Code:

tab speciality smoking_status, chi2

You have a polytomous variable as your "exposure" and a dichotomous variable as your "outcome" so this is a classic situation for a chi square test. ANOVA is really meant to be used with continuous outcomes. You can do this with ANOVA, and the resulting p-value will be not all that much different from what you get with chi-square, but in general they will not be exactly the same. Neither is perfect. With a dichotomous outcome variable, the homoscedasticity assumption for ANOVA is not met. And the chi-square test is, in any case, an approximation, though a good one unless you have some very small cells, in which case the Fisher "exact" test might be better. In your example data the overall sample size is very small and there are some zero combinations (all your surgeons are smokers and none of your internists are.) The presence of zero cells like that degrades the quality of both the ANOVA and the chi square approaches. But I'm imagining that in your full data set you won't have this problem.

Indeed, in my complete database are 100,000 doctors. But for some specialties there are zero combinations. How can I fix this problem and improve data analysis?

Last edited by Svetlana Bondar; 15 Dec 2019, 14:20.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29910
#4

15 Dec 2019, 14:55

How many different specialties are there, and how many of them are either all smokers or all non-smokers?
Comment
Svetlana Bondar

Join Date: Nov 2019

Posts: 30
#5

15 Dec 2019, 15:46

Originally posted by Clyde Schechter View Post

How many different specialties are there, and how many of them are either all smokers or all non-smokers?

27 different specialities, 10 of them are non-smokers. I used Chi square test with Bonferroni correction for comparing them

Last edited by Svetlana Bondar; 15 Dec 2019, 15:52.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29910
#6

15 Dec 2019, 16:02

Well, that's a lot of zero cells, even with 27 different specialties. I suppose it's not surprising--there aren't a lot of physicians who smoke. I might use the Fisher exact test here, though with a 27x2 table it might run out of memory or just take too long.
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17653

16 Dec 2019, 00:18

Svetlana:
as an aside to Clyde's (as always) towering insight, have you considered going -logit- (or -logistic-):

Code:

input long(speciality smoking_status)
1 2
1 2
1 1
1 1
2 1
2 1
2 2
2 2
3 1
3 1
3 1
3 1
4 2
4 2
4 2
4 2
5 1
5 2
5 1
5 1
end
label values speciality speciality
label def speciality 1 "cardiologist", modify
label def speciality 2 "endocrinologist", modify
label def speciality 3 "internal_medicine", modify
label def speciality 4 "surgery", modify
label def speciality 5 "traumatology", modify

g smoking_status_new=0 if smoking_status==1
replace smoking_status_new=1 if smoking_status==2
logit smoking_status_new i.speciality

. logit smoking_status_new i.speciality

note: 3.speciality != 0 predicts failure perfectly
      3.speciality dropped and 4 obs not used

note: 4.speciality != 0 predicts success perfectly
      4.speciality dropped and 4 obs not used

Iteration 0:   log likelihood = -8.1503192 
Iteration 1:   log likelihood = -7.7967769 
Iteration 2:   log likelihood = -7.7945187 
Iteration 3:   log likelihood =  -7.794518 

Logistic regression                             Number of obs     =         12
                                                LR chi2(2)        =       0.71
                                                Prob > chi2       =     0.7006
Log likelihood =  -7.794518                     Pseudo R2         =     0.0437

------------------------------------------------------------------------------------
smoking_status_new |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
        speciality |
  endocrinologist  |   1.56e-17   1.414214     0.00   1.000    -2.771808    2.771808
internal_medicine  |          0  (empty)
          surgery  |          0  (empty)
     traumatology  |  -1.098612   1.527525    -0.72   0.472    -4.092506    1.895282
                   |
             _cons |   6.96e-17          1     0.00   1.000    -1.959964    1.959964
------------------------------------------------------------------------------------

.

Kind regards,
Carlo
(StataNow 18.5)

Comment

Svetlana Bondar

Join Date: Nov 2019

Posts: 30
#8

17 Dec 2019, 15:03

Originally posted by Clyde Schechter View Post

Well, that's a lot of zero cells, even with 27 different specialties. I suppose it's not surprising--there aren't a lot of physicians who smoke. I might use the Fisher exact test here, though with a 27x2 table it might run out of memory or just take too long.

Thank you a lot! Your answer really helped me!
Comment

Announcement

Chi square test or ANOVA?

Comment

Comment

Comment

Comment

Comment

Comment

Comment