Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • chi2/Fisher exact/t tests across 2 samples

    Hello

    This might be very basic but I appreciate any advise.

    I plan to run chi2 & Fisher exact for the same categorical variable (e.g. education_level) and also t tests for the same continuous variable (e.g. age) but across 2 different samples. I have variable sample_A = 1 for everybody included in sample A and variable sample_B = 1 for everybody included in sample B. Sample A is bigger and includes every individual in sample B.

    How can I run these tests in Stata?

    Thanks very much.

  • #2
    Originally posted by Daniela Rodrigues View Post
    Sample A is bigger and includes every individual in sample B.

    How can I run these tests in Stata?

    Thanks very much.
    First, create a Sample C, the set of everyone in Sample A who is not also a member of Sample B. Then compare whatever (age, level of education) between Samples B and C.

    Comment


    • #3
      Thanks for your message Joseph Coveney.

      I do not understand your suggestion however. Why would I need to remove those individuals from Sample A that are also part of Sample B?

      My problem remains. I do not know what is the command to run chi2/ Fisher exact / t tests for the same variable across samples. I only know the command to run these tests for different variables in the same sample:

      Code:
      tab var1 var2, chi2 exact
      ttest var1 var2
      Any advice please?

      Thank you.

      Comment


      • #4
        Show data, please. You'll waste less time that way.

        Comment


        • #5
          Apologies, please see below an example of my dataset:

          student_id age education_level_parents school module mark sample_A sample_B
          1 13 medium A 1 81 1 .
          1 13 medium A 2 52 1 .
          2 11 high A 1 99 1 .
          3 12 low A 1 44 1 1
          4 15 low A 1 38 1 1
          4 15 low A 2 51 1 1
          5 10 high A 2 74 1 .
          6 12 medium A 1 66 1 .
          7 14 medium B 1 65 1 1
          7 14 medium B 2 58 1 1
          8 10 high B 1 49 1 1
          9 11 low B 1 53 1 1
          10 17 low B 1 49 1 .
          10 17 low B 2 51 1 .
          11 12 medium B 2 82 1 1

          I would like to calculate:

          1) t test for age across sample A & B

          2) chi2/ Fisher exact tests for education_level_parents across samples A & B


          Thank you
          Last edited by Daniela Rodrigues; 30 May 2021, 03:48.

          Comment


          • #6
            Code:
            generate byte grp = !(sample_A & mi(sample_B))
            ranksum age, by(grp)
            tabulate education_level_parents grp, chi2 exact

            Comment


            • #7
              Given the ordered-categorical nature of educational level, 2) would be better with something like
              Code:
              label define ELPs 1 low 2 medium 3 high
              encode education_level_parents, generate(elp) label(ELPs) noextend
              ranksum elp, by(grp)

              Comment


              • #8
                Joseph, I tried to reach you but [email protected] bounced. Could you send me an email to christoph dot schnelle at gmail dot com?

                Comment


                • #9
                  Thanks very much Joseph Coveney.

                  I just do not understand why I need to remove the observations in sample_A that also appear in sample_B. By running your code, I am not comparing sample_A with sample_B, but sample_A-sample_B vs sample_B.

                  In addition, I have individuals repeated in my dataset - e.g. 5th and 6th rows in my example of dataset above correspond to the same individual and so I would like to include only 1 observation for the test on age and education_level_parents, and I would like that individual in both sample_A and sample_B. Your suggestion seems to exclude them from one of the samples, and include both rows in the calculation as if they were 2 different people. Could you please advise?

                  Thank you.
                  Last edited by Daniela Rodrigues; 31 May 2021, 13:34.

                  Comment


                  • #10
                    Originally posted by Daniela Rodrigues View Post
                    I just do not understand why I need to remove the observations in sample_A that also appear in sample_B. By running your code, I am not comparing sample_A with sample_B, but sample_A-sample_B vs sample_B.
                    Correct. I take it that you're not aware of numerous previous threads on this matter, for example, this, this, this, this and this.

                    In addition, I have individuals repeated in my dataset - e.g. 5th and 6th rows in my example of dataset above correspond to the same individual and so I would like to include only 1 observation for the test on age and education_level_parents
                    You'd do something like
                    Code:
                    sort student_id
                    foreach var of varlist elp age {
                        by student_id: assert `var' == `var'[1]
                    }
                    by student_id: keep if _n == 1
                    and then proceed as I've shown above. Or else, just work with the original dataset of participant time-invariant characteristics from which you created this dataset through a -merge 1:m student_id- somewhere along your workflow.

                    Comment


                    • #11
                      Thanks for your input Joseph Coveney.

                      I went through the threads you shared on this topic - thanks for those.

                      I have encountered a situation, however, where I have mean age of sample_A = 53 and mean age of sample_B = 53. No difference. Yet, when I do the subset vs complementary comparison:

                      ranksum age, by(grp)
                      ttest age, by (grp)
                      I obtained
                      Code:
                      p-value < 0.001
                      because mean age of subset (sample_B) = 53 but mean age of complementary sample (sample_A - sample_B) = 48.

                      This does not seem reasonable. Could an alternative to creating complementary group be to compare the overall mean of the samples?

                      Thank you.

                      Comment


                      • #12
                        Originally posted by Daniela Rodrigues View Post
                        This does not seem reasonable.
                        Yeah, I agree—it seems that you made an error somewhere.

                        Could an alternative to creating complementary group be to compare the overall mean of the samples?
                        I don't know what you mean by "compare the overall mean of the samples". The overall mean is a single value. What are you comparing it to?

                        Comment


                        • #13
                          I don't know what you mean by "compare the overall mean of the samples". The overall mean is a single value. What are you comparing it to?
                          yes, sorry - please ignore my comment.

                          In terms of my current scenario I have the following for age_parents:
                          mean std dev.
                          Sample_A (main) 52.6 12
                          Sample_B (subset) 53.4 12
                          Sample_A - Sample_B (complementary) 47.5 12
                          Following the recommended approach to compare subset with complementary group, I end up comparing Sample_A (52.6) with Sample_A - Sample_B (47.5) and so I do find a statistically difference here (p-value <0.001). However, because the mean of Sample_A & Sample_B look the same (52.6 & 53.4), I assumed I would not find a difference across samples. That's why I said it seemed an unreasonable result.

                          I tried the following:

                          Code:
                          expand 2 if Sample_B==1, gen(dupindicator)
                          ranksum age, by(dupindicator)
                          but actually also found a pvalue < 0.05 but higher than the one before. I understand by running a ranksum/ t test I am assuming I am dealing with independent observations, and so by creating a duplication of observations might not be appropriate.

                          So, in conclusion - maybe it is not unreasonable to find that, even though the overall means of Sample_A & Sample_B are similar (53), due to the variance within groups, we can still reject the hypothesis that they come from the same sample. would you agree?

                          Thank you.
                          Last edited by Daniela Rodrigues; 02 Jun 2021, 10:02.

                          Comment


                          • #14
                            Originally posted by Daniela Rodrigues View Post
                            I tried the following
                            I don't understand what you are trying to do there, but it doesn't strike me as legit. The comparison that I recommend is 47.5 ± 12 to 53.4 ± 12, that is, B versus its complement. Skip the rest.

                            maybe it is not unreasonable to find that, even though the overall means of Sample_A & Sample_B are similar (53), . . . we can still reject the hypothesis that they come from the same sample. would you agree?
                            A hypothesis pair that posits such an alternative to the null seems pointless to me: you know that B comes from the same sample as A—you constructed it to.

                            Comment


                            • #15
                              Thanks Joseph Coveney for your input on this.

                              Comment

                              Working...
                              X