Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • two-sample kolmogorov-smirnov test

    Dear Statalists,

    I would like to test whether two samples can be assumed to come from the same population based on their distribution. I thought of the KS test and educated myself with the corresponding stata manual.
    Now I'm confused for two reasons.

    First, the D-value in the second row (-0.1667) implies the largest difference between the cumulative functions of group 2 compared to group 1 is minus 0.1667 while doing the math myself and also checking it in the graph is 1/12.
    Second, the insignificant p-value in the second row must be due to the low sample size. Hence, increasing the sample size by multiplying the two samples a few times should make the p-value significant at some point. However, what happens is that the p-value in the first row becomes soon significant. Following the manual, the testable null hypothesis in the first row is whether "group 1 contains smaller values than for group 2" which is clearly the base but rejected by the test.
    Third, I tried to reproduce example 1 in https://www.real-statistics.com/non-...-smirnov-test/ but neither the D-value nor the corresponding p-value match.

    It would be great if you can clarify. Thanks a lot,
    Frieder

  • #2
    You don't show any code, so it is hard to comment on what you did.

    Rupert Miller commented in 1986 (reprinted 1997) that the Kolmogorov-Smirnov test is more sensitive at comparing middles of distributions than the tails. The reference is given and the comment is repeated in the Stata manuals at [R] diagnostic plots. His precise comment was more laconic, but I think it follows immediately from the definition of the test statistic.

    As the opposite is what I usually need, this test joins a long list of tests that are in the books, but don't seem much use in my research practice. The issue is bundled together with the usual doubt about whether a significance test is what you really need most.

    I would draw a quantile plot (e.g. using qqplot (official) or qplot (Stata Journal)) or just use some focused test.

    https://www.stata-journal.com/articl...article=gr0027 has more some focused ideas. One main theme there is a standard in statistical graphics.

    If you are interested in differences, then calculate differences and plot them directly.

    Here

    1. Differences should be calculated as appropriate: for example, you might be best advised to calculate differences on a transformed scale, say logarithmic or reciprocal or logit.

    2. The usual pairing is difference (vertical axis) versus mean (or equivalently sum) (horizontal axis) as then the horizontal line difference = 0 marks the reference case of equal distributions.

    This can be done even if the comparison is of subsets that are of unequal size. You just calculate what I call corresponding quantiles, whereby you plot the smaller subset of quantiles against interpolated quantiles from the larger subset. cquantile is a helper command on SSC since 2005.

    Here is the mundane case of mpg from the auto data. Not only is there a systematic tendency for foreign mpg to be higher than domestic mpg, the difference is not just an additive shift. In fact, these data have been pored over many times, and the next step is to consider a transformation: reciprocal rather than logarithm is appealing on dimensional grounds, as gallons per mile are natural units. Indeed, for many people, litres per km would also seem a straightforward alternative. (In practice, gallons per 100 miles or litres per 100 km is slightly more appealing, a quite different point.)

    Code:
    sysuse auto, clear
    set scheme s1color 
    
    * ssc install cquantile
    cquantile mpg, by(foreign) gen(mpg0 mpg1)
    scatter mpg1 mpg0, ms(Oh) ytitle(Foreign) xtitle(Domestic) ///
    || function equality=x, range(mpg0) sort legend(off) name(G1, replace)
    
    gen diff = mpg1 - mpg0
    label var diff "Foreign quantile - Domestic quantile"
    gen mean = (mpg1 + mpg0) / 2
    label var mean "(Foreign quantile + Domestic quantile) / 2"
    scatter diff mean, ysc(r(0 .)) yla(0(2)10) ms(Oh) name(G2, replace)
    
    graph combine G1 G2, t1title(Foreign and domestic mpg)
    Click image for larger version

Name:	cquantile.png
Views:	1
Size:	30.1 KB
ID:	1691960

    Comment


    • #3
      Great, I've thoroughly worked through your mentioned article and the examples and I start to understand the appeal of quantile plots. Thanks for the "enlightenment", Nick!!

      Comment


      • #4
        Thanks much for closure (for the moment!).

        Comment


        • #5
          See now https://www.statalist.org/forums/for...lable-from-ssc for a way to automate the graphs in #2.

          Comment

          Working...
          X