chi2/Fisher exact/t tests across 2 samples

Daniela Rodrigues

Join Date: Jan 2021

Posts: 30
#1

chi2/Fisher exact/t tests across 2 samples

28 May 2021, 11:51

Hello

This might be very basic but I appreciate any advise.

I plan to run chi2 & Fisher exact for the same categorical variable (e.g. education_level) and also t tests for the same continuous variable (e.g. age) but across 2 different samples. I have variable sample_A = 1 for everybody included in sample A and variable sample_B = 1 for everybody included in sample B. Sample A is bigger and includes every individual in sample B.

How can I run these tests in Stata?

Thanks very much.
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#2

28 May 2021, 16:54

Originally posted by Daniela Rodrigues View Post

Sample A is bigger and includes every individual in sample B.

How can I run these tests in Stata?

Thanks very much.

First, create a Sample C, the set of everyone in Sample A who is not also a member of Sample B. Then compare whatever (age, level of education) between Samples B and C.
Comment
Daniela Rodrigues

Join Date: Jan 2021

Posts: 30
#3

29 May 2021, 17:38

Thanks for your message Joseph Coveney.

I do not understand your suggestion however. Why would I need to remove those individuals from Sample A that are also part of Sample B?

My problem remains. I do not know what is the command to run chi2/ Fisher exact / t tests for the same variable across samples. I only know the command to run these tests for different variables in the same sample:

Code:

tab var1 var2, chi2 exact ttest var1 var2

Any advice please?

Thank you.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#4

29 May 2021, 20:40

Show data, please. You'll waste less time that way.
1 like
Comment

Daniela Rodrigues

Join Date: Jan 2021
Posts: 30

30 May 2021, 03:34

Apologies, please see below an example of my dataset:

student_id	age	education_level_parents	school	module	mark	sample_A	sample_B
1	13	medium	A	1	81	1	.
1	13	medium	A	2	52	1	.
2	11	high	A	1	99	1	.
3	12	low	A	1	44	1	1
4	15	low	A	1	38	1	1
4	15	low	A	2	51	1	1
5	10	high	A	2	74	1	.
6	12	medium	A	1	66	1	.
7	14	medium	B	1	65	1	1
7	14	medium	B	2	58	1	1
8	10	high	B	1	49	1	1
9	11	low	B	1	53	1	1
10	17	low	B	1	49	1	.
10	17	low	B	2	51	1	.
11	12	medium	B	2	82	1	1

I would like to calculate:

1) t test for age across sample A & B

2) chi2/ Fisher exact tests for education_level_parents across samples A & B

Thank you

Last edited by Daniela Rodrigues; 30 May 2021, 03:48.

Comment

Joseph Coveney

Join Date: Apr 2014
Posts: 4410

30 May 2021, 23:00

Code:

generate byte grp = !(sample_A & mi(sample_B))
ranksum age, by(grp)
tabulate education_level_parents grp, chi2 exact

Comment

Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#7

30 May 2021, 23:03

Given the ordered-categorical nature of educational level, 2) would be better with something like

Code:

label define ELPs 1 low 2 medium 3 high encode education_level_parents, generate(elp) label(ELPs) noextend ranksum elp, by(grp)
1 like
Comment
Christoph Schnelle

Join Date: Oct 2014

Posts: 10
#8

30 May 2021, 23:05

Joseph, I tried to reach you but [email protected] bounced. Could you send me an email to christoph dot schnelle at gmail dot com?
Comment
Daniela Rodrigues

Join Date: Jan 2021

Posts: 30
#9

31 May 2021, 13:23

Thanks very much Joseph Coveney.

I just do not understand why I need to remove the observations in sample_A that also appear in sample_B. By running your code, I am not comparing sample_A with sample_B, but sample_A-sample_B vs sample_B.

In addition, I have individuals repeated in my dataset - e.g. 5th and 6th rows in my example of dataset above correspond to the same individual and so I would like to include only 1 observation for the test on age and education_level_parents, and I would like that individual in both sample_A and sample_B. Your suggestion seems to exclude them from one of the samples, and include both rows in the calculation as if they were 2 different people. Could you please advise?

Thank you.

Last edited by Daniela Rodrigues; 31 May 2021, 13:34.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#10

31 May 2021, 17:41

Originally posted by Daniela Rodrigues View Post

I just do not understand why I need to remove the observations in sample_A that also appear in sample_B. By running your code, I am not comparing sample_A with sample_B, but sample_A-sample_B vs sample_B.

Correct. I take it that you're not aware of numerous previous threads on this matter, for example, this, this, this, this and this.

In addition, I have individuals repeated in my dataset - e.g. 5th and 6th rows in my example of dataset above correspond to the same individual and so I would like to include only 1 observation for the test on age and education_level_parents

You'd do something like

Code:

sort student_id foreach var of varlist elp age { by student_id: assert `var' == `var'[1] } by student_id: keep if _n == 1

and then proceed as I've shown above. Or else, just work with the original dataset of participant time-invariant characteristics from which you created this dataset through a -merge 1:m student_id- somewhere along your workflow.
Comment
Daniela Rodrigues

Join Date: Jan 2021

Posts: 30
#11

02 Jun 2021, 09:05

Thanks for your input Joseph Coveney.

I went through the threads you shared on this topic - thanks for those.

I have encountered a situation, however, where I have mean age of sample_A = 53 and mean age of sample_B = 53. No difference. Yet, when I do the subset vs complementary comparison:

ranksum age, by(grp)
ttest age, by (grp)

I obtained

Code:

p-value < 0.001

because mean age of subset (sample_B) = 53 but mean age of complementary sample (sample_A - sample_B) = 48.

This does not seem reasonable. Could an alternative to creating complementary group be to compare the overall mean of the samples?

Thank you.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#12

02 Jun 2021, 09:19

Originally posted by Daniela Rodrigues View Post

This does not seem reasonable.

Yeah, I agree—it seems that you made an error somewhere.

Could an alternative to creating complementary group be to compare the overall mean of the samples?

I don't know what you mean by "compare the overall mean of the samples". The overall mean is a single value. What are you comparing it to?
Comment
Daniela Rodrigues

Join Date: Jan 2021

Posts: 30
#13

02 Jun 2021, 09:56

I don't know what you mean by "compare the overall mean of the samples". The overall mean is a single value. What are you comparing it to?

yes, sorry - please ignore my comment.

In terms of my current scenario I have the following for age_parents:
mean std dev.

Sample_A (main) 52.6 12

Sample_B (subset) 53.4 12

Sample_A - Sample_B (complementary) 47.5 12

Following the recommended approach to compare subset with complementary group, I end up comparing Sample_A (52.6) with Sample_A - Sample_B (47.5) and so I do find a statistically difference here (p-value <0.001). However, because the mean of Sample_A & Sample_B look the same (52.6 & 53.4), I assumed I would not find a difference across samples. That's why I said it seemed an unreasonable result.

I tried the following:

Code:

expand 2 if Sample_B==1, gen(dupindicator) ranksum age, by(dupindicator)

but actually also found a pvalue < 0.05 but higher than the one before. I understand by running a ranksum/ t test I am assuming I am dealing with independent observations, and so by creating a duplication of observations might not be appropriate.

So, in conclusion - maybe it is not unreasonable to find that, even though the overall means of Sample_A & Sample_B are similar (53), due to the variance within groups, we can still reject the hypothesis that they come from the same sample. would you agree?

Thank you.

Last edited by Daniela Rodrigues; 02 Jun 2021, 10:02.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#14

02 Jun 2021, 18:29

Originally posted by Daniela Rodrigues View Post

I tried the following

I don't understand what you are trying to do there, but it doesn't strike me as legit. The comparison that I recommend is 47.5 ± 12 to 53.4 ± 12, that is, B versus its complement. Skip the rest.

maybe it is not unreasonable to find that, even though the overall means of Sample_A & Sample_B are similar (53), . . . we can still reject the hypothesis that they come from the same sample. would you agree?

A hypothesis pair that posits such an alternative to the null seems pointless to me: you know that B comes from the same sample as A—you constructed it to.
Comment
Daniela Rodrigues

Join Date: Jan 2021

Posts: 30
#15

03 Jun 2021, 03:29

Thanks Joseph Coveney for your input on this.
Comment

	mean	std dev.
Sample_A (main)	52.6	12
Sample_B (subset)	53.4	12
Sample_A - Sample_B (complementary)	47.5	12

Announcement

chi2/Fisher exact/t tests across 2 samples

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment