How to I define a a custom distribution to use in One-sample ksmirnov test?

Razvan Ghita

Join Date: Feb 2022
Posts: 3

How to I define a a custom distribution to use in One-sample ksmirnov test?

02 Feb 2022, 11:24

Hello,

I want to use the One-sample Kolmogorov –Smirnov test to check if the distribution of my sample data corresponds to a theoretical distribution. My theoretical distribution is posted below. In case it is useful, I describe the reason why I need this particular theoretical distribution. The theoretical distribution is given by a game: a participant throws 10 unfair coins, each coin has a 10% chance of landing on heads. k measures the total number of heads in a series of 10 throws. I calculated the probability using the Bernoulli trial.

I want to see if my collected data matches the theoretical distribution presented below. One-sample ksmirnov is probably the appropriate test for this. It is clear to me how to perform this test to compare the distribution of my sample with, for example, a Student’s t distribution - ksmirnov k = t(5,v1). However, I don't know how to compare the distribution of my sample to the custom distribution presented below. Any help would be appreciated.

k	Probability of occurrence
0	0.3486784401
1	0.3874204890
2	0.1937102445
3	0.0573956280
4	0.0111602610
5	0.0014880348
6	0.0001377810
7	0.0000087480
8	0.0000003645
9	0.0000000090
10	0.0000000001

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

02 Feb 2022, 13:47

-help ksmirnov- indicates that you can use an expression to specify the expected theoretical distribution for a one-sample test, which might be possible, since each coin flip is Bernouilli with some known (?) varying parameter.

Lacking that theoretical distribution: -search ksmirnov- reveals some user-written software, of which the most relevant appears to be -mgof- (See -ssc describe mgof-). I have not used this program, but it *appears* to me that the code below does what you want, using a feature of -mgof- that allows you to specify a matrix containing the observed and expected frequencies. (I have not previously used -mgof-, and am mildly uncertain about certain features of its syntax. I would strongly encourage you to find some known example of the one-sample ks test and compare results using something like the code below before trusting that what I'm suggesting is right.) A good feature of -mgof-, by the way, is that it readily uses exact or Monte Carlo methods to obtain a p-value. Given your tiny sample size of 10 observations (right? 10 observations of the count of heads?), the asymptotic ks or Chi-Squared etc. test will not perform well. Here's what I think will work:

Code:

clear // Theoretical point probabilities, for 0/10 heads, put in a column vector for use by -mgof-. mat P = /// 0.3486784401 \ /// 0.3874204890 \ /// 0.1937102445 \ /// 0.0573956280 \ /// 0.0111602610 \ /// 0.0014880348 \ /// 0.0001377810 \ /// 0.0000087480 \ /// 0.0000003645 \ /// 0.0000000090 \ /// 0.0000000001 // // Simulate something like your observed data, which you did not happen to provide. set seed 4755 local throws = 10 set obs `throws' local pheads = 0.1 gen byte heads = rbinomial(10, `pheads') // Observed counts of heads, obtained // so as to include counts of 0 mat O = J(11,1, 0) forval i = 0/10 { count if heads == `i' mat O[`i'+1, 1] = r(N) } mat list O // // Observed/expected matrix for -mgof- mat E = `throws' * P //expected mat OE = O, E mat list OE // mgof, matrix(OE) ksmirnov mc
1 like
Comment

Razvan Ghita

Join Date: Feb 2022
Posts: 3

03 Feb 2022, 10:40

Thank you so much, I tried to use ksmirno and to use ab expression to specify the expected theoretical distribution for a one-sample test, but I could not figure out the syntax to achieve this. An example of how to specify the expected theoretical distribution would solve my problem, I think.

I tried using the code but I don't think I could get it to work.

Code:

clear
// Theoretical point probabilities, for 0/10 heads, put in a column vector for use by -mgof-.
mat P = ///
0.3486784401  \    ///
0.3874204890  \    ///
0.1937102445  \    ///
0.0573956280  \    ///
0.0111602610  \    ///
0.0014880348  \    ///
0.0001377810  \    ///
0.0000087480  \    ///
0.0000003645  \    ///
0.0000000090  \    ///
0.0000000001     

local throws = 81
set obs `throws'

// Example of frequency observerved in data set
mat O = ///
27 \    ///
33 \    ///
10\    ///
5  \    ///
1\    ///
0\    ///
0 \    ///
0\    ///
0 \    ///
0\    ///
5
//
mat list O
//
// Observed/expected matrix for -mgof-
mat E = `throws' * P //expected
mat OE = O, E
mat list OE
//
mgof, matrix(OE) ksmirnov mc

Gives this output:

Code:

                                               Number of obs =      81
                                               N of outcomes =      11
                                               Replications  =   10000

----------------------------------------------------------------------
                      |                  Exact                        
      Goodness-of-fit |       Coef.    P-value    [99% Conf. Interval]
----------------------+-----------------------------------------------
         Pearson's X2 |    3.09e+09     0.0000      0.0000      0.0005
 Log likelihood ratio |    195.2182     0.0000      0.0000      0.0005
 Kolmogorov-Smirnov D |    .0656116     0.3632      0.3508      0.3757
----------------------------------------------------------------------

I doubt this is correct because it gives an insignificant p value for the test although, in the hypothetical example, an event that has a 10^-10 chance of happening by chance happens 5 times. Thank you very much for the help so far,

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

03 Feb 2022, 10:56

Kolmogorov-Smirnov is beautiful mathematics but I think oversold. Necessarily it is more sensitive in the middle of a distribution than in the tails, precisely the opposite of what is most needed in practice. I'd fall back on chi-square -- even though it doesn't use all the information in the data -- mostly because just as valuable as some flavour of P-value are residuals that you can use to see where the fit is best or worst.

chitesti from tab_chi on SSC goes back to 2004 but does not appear to have been superseded.
2 likes
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

03 Feb 2022, 11:49

That said, here's using Mata as a calculator. The chi-square is massive because of a discrepancy in the last bin. K-Smirnov fails to register this, my earlier point precisely.

.

Code:

 mata
------------------------------------------------- mata (type end to exit) ------------------------------
: 
: P = (
> 0.3486784401  \    ///
> 0.3874204890  \    ///
> 0.1937102445  \    ///
> 0.0573956280  \    ///
> 0.0111602610  \    ///
> 0.0014880348  \    ///
> 0.0001377810  \    ///
> 0.0000087480  \    ///
> 0.0000003645  \    ///
> 0.0000000090  \    ///
> 0.0000000001)      

: 
: throws = 81

: 
: obs = (27 \ 33 \ 10 \ 5 \ 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 5) 

: 
: exp = throws * P 

: 
: chi = (obs :- exp) :/ sqrt(exp)

: 
: chisq = sum(chi:^2)

: 
: chi 
                   1
     +----------------+
   1 |  -.2338836575  |
   2 |   .2889994768  |
   3 |  -1.436593504  |
   4 |   .1627677816  |
   5 |   .1009896474  |
   6 |   -.347175487  |
   7 |  -.1056421365  |
   8 |  -.0266193163  |
   9 |  -.0054336452  |
  10 |   -.000853815  |
  11 |   55555.55547  |
     +----------------+

: 
: strofreal(chi, "%12.3f")
                1
     +-------------+
   1 |     -0.234  |
   2 |      0.289  |
   3 |     -1.437  |
   4 |      0.163  |
   5 |      0.101  |
   6 |     -0.347  |
   7 |     -0.106  |
   8 |     -0.027  |
   9 |     -0.005  |
  10 |     -0.001  |
  11 |  55555.555  |
     +-------------+

: 
: chisq 
  3086419745

: 
: end

Comment

Razvan Ghita

Join Date: Feb 2022

Posts: 3
#6

04 Feb 2022, 01:15

Thank you!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#7

04 Feb 2022, 03:40

Terms of the form (observed MINUS expected) / sqrt(expected) are now often called Pearson residuals. This is a little generous in that there is no evidence (that I've seen) that (Karl) Pearson used them, but they acknowledge his crucial role in proposing chi-square tests (although he didn't understand his own creation very well, as witness Fisher's corrections about degrees of freedom). Such residuals seem to have grown from use in the 1950s to being mentioned in formal literature from the early 1970s or so.
Comment

Announcement

How to I define a a custom distribution to use in One-sample ksmirnov test?

Comment

Comment

Comment

Comment

Comment

Comment