Conducting t-tests between an original sample and a sub-sample

Jessica Berrett

Join Date: Sep 2019

Posts: 57
#1

Conducting t-tests between an original sample and a sub-sample

15 Feb 2020, 12:44

I'm trying to figure out the best way to conduct t-tests between some descriptive statistics of my original sample and my final sample after data cleaning. Essentially, I want to see how representative my final sample is of my original sample. My data set is large, so unfortunately I can't manually track the variables that are getting dropped and adjust the spreadsheets. I see that there are some old posts on this topic from a number of years ago, but I did not find them helpful and I just wanted to see if there might be some better or newer approaches to this. Thank you!
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#2

15 Feb 2020, 14:35

Jessica:
can't you prepare two datasets with the original sample and the subsample, -append- them and then run -ttest-?
That said, if the subsample is a sort of make-up of the original one, due to the arbitrary omission of missing values, it is probably biased, especially if the missingness is informative.
Eventually, whenever we talk about tests we are leaving the descriptive world (where data speak for themselves) and we enter the assumption-paved inferential realm.

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#3

16 Feb 2020, 04:38

t tests are defined for disjoint sets or paired values. You don't seem to have either, so the problem seems one of descriptive statistics, including graphics.
Comment
Jessica Berrett

Join Date: Sep 2019

Posts: 57
#4

16 Feb 2020, 05:26

Hi Carlo, I was thinking that I needed to subtract the subsample from the original sample, which I couldn't figure out how to do. But as I think about this more, you're correct. It would be a comparison between the subsample and sample. However, as Nick pointed out t-tests may not be appropriate for this type of comparison. In this case, what other types of significance tests may be appropriate in determining the representativeness of a sample? Thank you!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#5

16 Feb 2020, 05:42

No other significance tests spring to mind for precisely the same reason. If this were my problem, I would make the comparison graphical.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#6

16 Feb 2020, 05:49

Jessica:
re-reading your original post and Nick's wise advice, I noticed I should have been more careful in my previous reply.
As a second thought I would follow Nick's advice #5.
That said, the issue of a made-up subsample by omitting observations with missing values still holds.

Kind regards,
Carlo
(Stata 19.0)
Comment
Ruzica Savcic

Join Date: Oct 2016

Posts: 28
#7

11 Apr 2021, 04:45

Hello! I've got the same problem and I was thinking whether this solution would be correct: if I divide the sample into the subsample I am interested in and the other subsample with the rest of observations, and t-test whether the descriptive statistics of choice variables are statistically the same across these two subsamples, then if there happens to be no significant difference it would mean that the subsample is truly representative of the sample. This only works, of course, if the t-test values show no significant difference. In case there is a significant difference, then we cannot know if the same conclusion would stand if the subsample was compared to the entire sample. Would you say that this could be a correct solution?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#8

11 Apr 2021, 06:54

Yes and no. The t test compares means, not whether distributions are the same. The best way is to compare distributions is to plot them, say as quantiles.
1 like
Comment
Ruzica Savcic

Join Date: Oct 2016

Posts: 28
#9

12 Apr 2021, 04:08

How about the Kolmogorov–Smirnov test of equal distributions?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#10

12 Apr 2021, 04:52

It is what it is. I've never found it more helpful than other approaches. I'd note the detail that K-S necessarily works better at detecting differences between middles of distribution, rather than tails, often the opposite of what is needed.

Some context would help here. In various fields the culture runs that every finding must be supported by a formal significance test as a sign of objectivity and rigour. Whatever the K-S result, there is always a question of what the distributions are and of whether you're getting a particular result for the right reasons.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#11

12 Apr 2021, 05:48

Some of the problems here arise from what we are used to. I am of a generation whose early statistical education placed more emphasis on histograms than any other graphical method. Histograms while often very helpful have known (and often severe) problems of dependence on bin width and origin, and while two or more histograms contain a lot of information in principle they aren't always easy to compare. in practice.

Box plots often leave out far too much, but, preferably enhanced with more detail than is common, they can be almost the only practical way to compare dozens of distributions.

The Kolmogorov-Smirnov test is tied to comparison of cumulative distributions, but while plots of such functions are enjoying a surge of popularity in some fields, I don't find them easy to think about, especially if what you care about is in tails where cumulative probabilities are near 0 or near 1.

For comparing two groups, I would tend to start with quantile plots, See e.g. #2 at https://www.statalist.org/forums/for...dable-from-ssc

Naturally much depends on your data, your purpose and the intended readership. Oddly, or not, graphical practice in the better newspapers and weeklies is often far ahead of standards in some academic fields....
1 like
Comment
Ruzica Savcic

Join Date: Oct 2016

Posts: 28
#12

12 Apr 2021, 12:16

I used quantile and it looked pretty indicative of similarities in the distribution. transplot showed the distributions were basically the same. Thank you!

Last edited by Ruzica Savcic; 12 Apr 2021, 12:38.
Comment

Announcement

Conducting t-tests between an original sample and a sub-sample

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment