Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Chi square test or ANOVA?

    Hi! I have two variables: doctors and their smoking status (smoking or don't smoking). I want to find out if doctors of different specialties differ in smoking status. What should I use in this case? Chi square test or ANOVA? Help please! I am confused about this simple question.
    HTML Code:
    [CODE]
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long(speciality smoking_status)
    1 2
    1 2
    1 1
    1 1
    2 1
    2 1
    2 2
    2 2
    3 1
    3 1
    3 1
    3 1
    4 2
    4 2
    4 2
    4 2
    5 1
    5 2
    5 1
    5 1
    end
    label values speciality speciality
    label def speciality 1 "cardiologist", modify
    label def speciality 2 "endocrinologist", modify
    label def speciality 3 "internal_medicine", modify
    label def speciality 4 "surgery", modify
    label def speciality 5 "traumatology", modify
    label values smoking_status smoking_status
    label def smoking_status 1 "no", modify
    label def smoking_status 2 "yes", modify
    [/CODE]

  • #2
    Code:
    tab speciality smoking_status, chi2
    You have a polytomous variable as your "exposure" and a dichotomous variable as your "outcome" so this is a classic situation for a chi square test. ANOVA is really meant to be used with continuous outcomes. You can do this with ANOVA, and the resulting p-value will be not all that much different from what you get with chi-square, but in general they will not be exactly the same. Neither is perfect. With a dichotomous outcome variable, the homoscedasticity assumption for ANOVA is not met. And the chi-square test is, in any case, an approximation, though a good one unless you have some very small cells, in which case the Fisher "exact" test might be better. In your example data the overall sample size is very small and there are some zero combinations (all your surgeons are smokers and none of your internists are.) The presence of zero cells like that degrades the quality of both the ANOVA and the chi square approaches. But I'm imagining that in your full data set you won't have this problem.
    Last edited by Clyde Schechter; 15 Dec 2019, 14:58.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      Code:
      tab speciality smoking_status, chi2
      You have a polytomous variable as your "exposure" and a dichotomous variable as your "outcome" so this is a classic situation for a chi square test. ANOVA is really meant to be used with continuous outcomes. You can do this with ANOVA, and the resulting p-value will be not all that much different from what you get with chi-square, but in general they will not be exactly the same. Neither is perfect. With a dichotomous outcome variable, the homoscedasticity assumption for ANOVA is not met. And the chi-square test is, in any case, an approximation, though a good one unless you have some very small cells, in which case the Fisher "exact" test might be better. In your example data the overall sample size is very small and there are some zero combinations (all your surgeons are smokers and none of your internists are.) The presence of zero cells like that degrades the quality of both the ANOVA and the chi square approaches. But I'm imagining that in your full data set you won't have this problem.
      Indeed, in my complete database are 100,000 doctors. But for some specialties there are zero combinations. How can I fix this problem and improve data analysis?
      Last edited by Svetlana Bondar; 15 Dec 2019, 15:20.

      Comment


      • #4
        How many different specialties are there, and how many of them are either all smokers or all non-smokers?

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          How many different specialties are there, and how many of them are either all smokers or all non-smokers?
          27 different specialities, 10 of them are non-smokers. I used Chi square test with Bonferroni correction for comparing them
          Last edited by Svetlana Bondar; 15 Dec 2019, 16:52.

          Comment


          • #6
            Well, that's a lot of zero cells, even with 27 different specialties. I suppose it's not surprising--there aren't a lot of physicians who smoke. I might use the Fisher exact test here, though with a 27x2 table it might run out of memory or just take too long.

            Comment


            • #7
              Svetlana:
              as an aside to Clyde's (as always) towering insight, have you considered going -logit- (or -logistic-):
              Code:
              input long(speciality smoking_status)
              1 2
              1 2
              1 1
              1 1
              2 1
              2 1
              2 2
              2 2
              3 1
              3 1
              3 1
              3 1
              4 2
              4 2
              4 2
              4 2
              5 1
              5 2
              5 1
              5 1
              end
              label values speciality speciality
              label def speciality 1 "cardiologist", modify
              label def speciality 2 "endocrinologist", modify
              label def speciality 3 "internal_medicine", modify
              label def speciality 4 "surgery", modify
              label def speciality 5 "traumatology", modify
              
              g smoking_status_new=0 if smoking_status==1
              replace smoking_status_new=1 if smoking_status==2
              logit smoking_status_new i.speciality
              
              . logit smoking_status_new i.speciality
              
              note: 3.speciality != 0 predicts failure perfectly
                    3.speciality dropped and 4 obs not used
              
              note: 4.speciality != 0 predicts success perfectly
                    4.speciality dropped and 4 obs not used
              
              Iteration 0:   log likelihood = -8.1503192 
              Iteration 1:   log likelihood = -7.7967769 
              Iteration 2:   log likelihood = -7.7945187 
              Iteration 3:   log likelihood =  -7.794518 
              
              Logistic regression                             Number of obs     =         12
                                                              LR chi2(2)        =       0.71
                                                              Prob > chi2       =     0.7006
              Log likelihood =  -7.794518                     Pseudo R2         =     0.0437
              
              ------------------------------------------------------------------------------------
              smoking_status_new |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
              -------------------+----------------------------------------------------------------
                      speciality |
                endocrinologist  |   1.56e-17   1.414214     0.00   1.000    -2.771808    2.771808
              internal_medicine  |          0  (empty)
                        surgery  |          0  (empty)
                   traumatology  |  -1.098612   1.527525    -0.72   0.472    -4.092506    1.895282
                                 |
                           _cons |   6.96e-17          1     0.00   1.000    -1.959964    1.959964
              ------------------------------------------------------------------------------------
              
              .
              Kind regards,
              Carlo
              (StataNow 18.5)

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                Well, that's a lot of zero cells, even with 27 different specialties. I suppose it's not surprising--there aren't a lot of physicians who smoke. I might use the Fisher exact test here, though with a 27x2 table it might run out of memory or just take too long.
                Thank you a lot! Your answer really helped me!

                Comment

                Working...
                X