two-sample kolmogorov-smirnov test

Frieder Neunhoeffer

Join Date: Mar 2022

Posts: 11
#1

two-sample kolmogorov-smirnov test

03 Dec 2022, 05:43

Dear Statalists,

I would like to test whether two samples can be assumed to come from the same population based on their distribution. I thought of the KS test and educated myself with the corresponding stata manual.
Now I'm confused for two reasons.

First, the D-value in the second row (-0.1667) implies the largest difference between the cumulative functions of group 2 compared to group 1 is minus 0.1667 while doing the math myself and also checking it in the graph is 1/12.
Second, the insignificant p-value in the second row must be due to the low sample size. Hence, increasing the sample size by multiplying the two samples a few times should make the p-value significant at some point. However, what happens is that the p-value in the first row becomes soon significant. Following the manual, the testable null hypothesis in the first row is whether "group 1 contains smaller values than for group 2" which is clearly the base but rejected by the test.
Third, I tried to reproduce example 1 in https://www.real-statistics.com/non-...-smirnov-test/ but neither the D-value nor the corresponding p-value match.

It would be great if you can clarify. Thanks a lot,
Frieder
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35431
#2

04 Dec 2022, 04:27

You don't show any code, so it is hard to comment on what you did.

Rupert Miller commented in 1986 (reprinted 1997) that the Kolmogorov-Smirnov test is more sensitive at comparing middles of distributions than the tails. The reference is given and the comment is repeated in the Stata manuals at [R] diagnostic plots. His precise comment was more laconic, but I think it follows immediately from the definition of the test statistic.

As the opposite is what I usually need, this test joins a long list of tests that are in the books, but don't seem much use in my research practice. The issue is bundled together with the usual doubt about whether a significance test is what you really need most.

I would draw a quantile plot (e.g. using qqplot (official) or qplot (Stata Journal)) or just use some focused test.

https://www.stata-journal.com/articl...article=gr0027 has more some focused ideas. One main theme there is a standard in statistical graphics.

If you are interested in differences, then calculate differences and plot them directly.

Here

1. Differences should be calculated as appropriate: for example, you might be best advised to calculate differences on a transformed scale, say logarithmic or reciprocal or logit.

2. The usual pairing is difference (vertical axis) versus mean (or equivalently sum) (horizontal axis) as then the horizontal line difference = 0 marks the reference case of equal distributions.

This can be done even if the comparison is of subsets that are of unequal size. You just calculate what I call corresponding quantiles, whereby you plot the smaller subset of quantiles against interpolated quantiles from the larger subset. cquantile is a helper command on SSC since 2005.

Here is the mundane case of mpg from the auto data. Not only is there a systematic tendency for foreign mpg to be higher than domestic mpg, the difference is not just an additive shift. In fact, these data have been pored over many times, and the next step is to consider a transformation: reciprocal rather than logarithm is appealing on dimensional grounds, as gallons per mile are natural units. Indeed, for many people, litres per km would also seem a straightforward alternative. (In practice, gallons per 100 miles or litres per 100 km is slightly more appealing, a quite different point.)

Code:

sysuse auto, clear set scheme s1color * ssc install cquantile cquantile mpg, by(foreign) gen(mpg0 mpg1) scatter mpg1 mpg0, ms(Oh) ytitle(Foreign) xtitle(Domestic) /// || function equality=x, range(mpg0) sort legend(off) name(G1, replace) gen diff = mpg1 - mpg0 label var diff "Foreign quantile - Domestic quantile" gen mean = (mpg1 + mpg0) / 2 label var mean "(Foreign quantile + Domestic quantile) / 2" scatter diff mean, ysc(r(0 .)) yla(0(2)10) ms(Oh) name(G2, replace) graph combine G1 G2, t1title(Foreign and domestic mpg)
1 like
Comment
Frieder Neunhoeffer

Join Date: Mar 2022

Posts: 11
#3

08 Dec 2022, 04:50

Great, I've thoroughly worked through your mentioned article and the examples and I start to understand the appeal of quantile plots. Thanks for the "enlightenment", Nick!!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#4

08 Dec 2022, 05:54

Thanks much for closure (for the moment!).
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#5

12 Dec 2022, 12:04

See now https://www.statalist.org/forums/for...lable-from-ssc for a way to automate the graphs in #2.
Comment

Announcement

two-sample kolmogorov-smirnov test

Comment

Comment

Comment

Comment