Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to I define a a custom distribution to use in One-sample ksmirnov test?

    Hello,

    I want to use the One-sample Kolmogorov –Smirnov test to check if the distribution of my sample data corresponds to a theoretical distribution. My theoretical distribution is posted below. In case it is useful, I describe the reason why I need this particular theoretical distribution. The theoretical distribution is given by a game: a participant throws 10 unfair coins, each coin has a 10% chance of landing on heads. k measures the total number of heads in a series of 10 throws. I calculated the probability using the Bernoulli trial.

    I want to see if my collected data matches the theoretical distribution presented below. One-sample ksmirnov is probably the appropriate test for this. It is clear to me how to perform this test to compare the distribution of my sample with, for example, a Student’s t distribution - ksmirnov k = t(5,v1). However, I don't know how to compare the distribution of my sample to the custom distribution presented below. Any help would be appreciated.

    k Probability of occurrence
    0 0.3486784401
    1 0.3874204890
    2 0.1937102445
    3 0.0573956280
    4 0.0111602610
    5 0.0014880348
    6 0.0001377810
    7 0.0000087480
    8 0.0000003645
    9 0.0000000090
    10 0.0000000001

  • #2
    -help ksmirnov- indicates that you can use an expression to specify the expected theoretical distribution for a one-sample test, which might be possible, since each coin flip is Bernouilli with some known (?) varying parameter.

    Lacking that theoretical distribution: -search ksmirnov- reveals some user-written software, of which the most relevant appears to be -mgof- (See -ssc describe mgof-). I have not used this program, but it *appears* to me that the code below does what you want, using a feature of -mgof- that allows you to specify a matrix containing the observed and expected frequencies. (I have not previously used -mgof-, and am mildly uncertain about certain features of its syntax. I would strongly encourage you to find some known example of the one-sample ks test and compare results using something like the code below before trusting that what I'm suggesting is right.) A good feature of -mgof-, by the way, is that it readily uses exact or Monte Carlo methods to obtain a p-value. Given your tiny sample size of 10 observations (right? 10 observations of the count of heads?), the asymptotic ks or Chi-Squared etc. test will not perform well. Here's what I think will work:
    Code:
    clear
    // Theoretical point probabilities, for 0/10 heads, put in a column vector for use by -mgof-.
    mat P = ///
    0.3486784401  \    ///
    0.3874204890  \    ///
    0.1937102445  \    ///
    0.0573956280  \    ///
    0.0111602610  \    ///
    0.0014880348  \    ///
    0.0001377810  \    ///
    0.0000087480  \    ///
    0.0000003645  \    ///
    0.0000000090  \    ///
    0.0000000001     
    //
    // Simulate something like your observed data, which you did not happen to provide.
    set seed 4755
    local throws = 10
    set obs `throws'
    local pheads = 0.1
    gen byte heads =  rbinomial(10, `pheads')
    // Observed counts of heads, obtained
    // so as to include counts of 0
    mat O = J(11,1, 0)
    forval i = 0/10 {
      count if heads == `i'
      mat O[`i'+1, 1] = r(N)
    }
    mat list O
    //
    // Observed/expected matrix for -mgof-
    mat E = `throws' * P //expected
    mat OE = O, E
    mat list OE
    //
    mgof, matrix(OE) ksmirnov mc

    Comment


    • #3
      Thank you so much, I tried to use ksmirno and to use ab expression to specify the expected theoretical distribution for a one-sample test, but I could not figure out the syntax to achieve this. An example of how to specify the expected theoretical distribution would solve my problem, I think.

      I tried using the code but I don't think I could get it to work.

      Code:
      clear
      // Theoretical point probabilities, for 0/10 heads, put in a column vector for use by -mgof-.
      mat P = ///
      0.3486784401  \    ///
      0.3874204890  \    ///
      0.1937102445  \    ///
      0.0573956280  \    ///
      0.0111602610  \    ///
      0.0014880348  \    ///
      0.0001377810  \    ///
      0.0000087480  \    ///
      0.0000003645  \    ///
      0.0000000090  \    ///
      0.0000000001     
      
      local throws = 81
      set obs `throws'
      
      // Example of frequency observerved in data set
      mat O = ///
      27 \    ///
      33 \    ///
      10\    ///
      5  \    ///
      1\    ///
      0\    ///
      0 \    ///
      0\    ///
      0 \    ///
      0\    ///
      5
      //
      mat list O
      //
      // Observed/expected matrix for -mgof-
      mat E = `throws' * P //expected
      mat OE = O, E
      mat list OE
      //
      mgof, matrix(OE) ksmirnov mc
      Gives this output:
      Code:
                                                     Number of obs =      81
                                                     N of outcomes =      11
                                                     Replications  =   10000
      
      ----------------------------------------------------------------------
                            |                  Exact                        
            Goodness-of-fit |       Coef.    P-value    [99% Conf. Interval]
      ----------------------+-----------------------------------------------
               Pearson's X2 |    3.09e+09     0.0000      0.0000      0.0005
       Log likelihood ratio |    195.2182     0.0000      0.0000      0.0005
       Kolmogorov-Smirnov D |    .0656116     0.3632      0.3508      0.3757
      ----------------------------------------------------------------------
      I doubt this is correct because it gives an insignificant p value for the test although, in the hypothetical example, an event that has a 10^-10 chance of happening by chance happens 5 times. Thank you very much for the help so far,

      Comment


      • #4
        Kolmogorov-Smirnov is beautiful mathematics but I think oversold. Necessarily it is more sensitive in the middle of a distribution than in the tails, precisely the opposite of what is most needed in practice. I'd fall back on chi-square -- even though it doesn't use all the information in the data -- mostly because just as valuable as some flavour of P-value are residuals that you can use to see where the fit is best or worst.

        chitesti from tab_chi on SSC goes back to 2004 but does not appear to have been superseded.

        Comment


        • #5
          That said, here's using Mata as a calculator. The chi-square is massive because of a discrepancy in the last bin. K-Smirnov fails to register this, my earlier point precisely.


          .
          Code:
           mata
          ------------------------------------------------- mata (type end to exit) ------------------------------
          : 
          : P = (
          > 0.3486784401  \    ///
          > 0.3874204890  \    ///
          > 0.1937102445  \    ///
          > 0.0573956280  \    ///
          > 0.0111602610  \    ///
          > 0.0014880348  \    ///
          > 0.0001377810  \    ///
          > 0.0000087480  \    ///
          > 0.0000003645  \    ///
          > 0.0000000090  \    ///
          > 0.0000000001)      
          
          : 
          : throws = 81
          
          : 
          : obs = (27 \ 33 \ 10 \ 5 \ 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 5) 
          
          : 
          : exp = throws * P 
          
          : 
          : chi = (obs :- exp) :/ sqrt(exp)
          
          : 
          : chisq = sum(chi:^2)
          
          : 
          : chi 
                             1
               +----------------+
             1 |  -.2338836575  |
             2 |   .2889994768  |
             3 |  -1.436593504  |
             4 |   .1627677816  |
             5 |   .1009896474  |
             6 |   -.347175487  |
             7 |  -.1056421365  |
             8 |  -.0266193163  |
             9 |  -.0054336452  |
            10 |   -.000853815  |
            11 |   55555.55547  |
               +----------------+
          
          : 
          : strofreal(chi, "%12.3f")
                          1
               +-------------+
             1 |     -0.234  |
             2 |      0.289  |
             3 |     -1.437  |
             4 |      0.163  |
             5 |      0.101  |
             6 |     -0.347  |
             7 |     -0.106  |
             8 |     -0.027  |
             9 |     -0.005  |
            10 |     -0.001  |
            11 |  55555.555  |
               +-------------+
          
          : 
          : chisq 
            3086419745
          
          : 
          : end

          Comment


          • #6
            Thank you!

            Comment


            • #7
              Terms of the form (observed MINUS expected) / sqrt(expected) are now often called Pearson residuals. This is a little generous in that there is no evidence (that I've seen) that (Karl) Pearson used them, but they acknowledge his crucial role in proposing chi-square tests (although he didn't understand his own creation very well, as witness Fisher's corrections about degrees of freedom). Such residuals seem to have grown from use in the 1950s to being mentioned in formal literature from the early 1970s or so.

              Comment

              Working...
              X