Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata equivalent to "qbinom" in R?

    Hi All,

    R has a function called qbinom that returns the returns the value of the inverse cumulative density function (cdf) of the binomial distribution given a certain random variable q, number of trials (size) and probability of success on each trial (prob).


    For example,
    Code:
    qbinom(0.005, 2000, 0.02241113)
    returns the value 29. That is, it "backs in" to the number of successes k that are required in Stata's binomial (and invbinomial) functions.

    Does anyone know how to get an equivalent result in Stata?

    (I have figured out binomial() is equal to R's pbinom() and that binomialp() is equal to R's dbinom()).

    Thanks!

    Ariel

  • #2
    There is no closed form for the inverse CDF of the binomial distribution - R uses the Cornish–Fisher Expansion (https://en.wikipedia.org/wiki/Cornis...sher_expansion) to calculate the value of k.

    Here is a naïve approach:
    Code:
    capture program drop qbinom
    program define qbinom, rclass
        syntax , q(numlist) size(numlist)  prob(numlist)
        local k = 0
        local p = 0
        while `p' -  `prob' <= -1e-10 {
            local ++k
            local p = invbinomial(`size', `k', `q')
        }
        disp " k = " `k'
        return scalar k = `k'
    end
    Using Haghish's -rcall-
    Code:
    . qbinom, q(.4) size(30) prob(.25)
     k = 7
    
    . rcall: qbinom(.4, size=30, prob=.25)
    program Rpath already defined
    [1] 7
    
    . qbinom, q(.005) size(2000) prob(.02241113)
     k = 29
    
    . rcall: qbinom(.005, size=2000, prob=.02241113)
    program Rpath already defined
    [1] 29
    Haghish, E. F. (2019). Seamless interactive language interfacing between R and Stata. The Stata Journal, 19(1), 61-82.

    Comment


    • #3
      If you're using a recent version of Stata with its Python integration, there's also a similar function in the SciPy library's stats subpackage that you could avail yourself of. Based on its speed, it seems to also use an approximation to the CDF in order to choose an entry point before beginning its search.

      If you're doing one-offs, then Scott's naive approach is all you need, but if you're contemplating simulation or other repetitive use, it might be faster in the long run to call Python. If the example below, it's about 50-fold the speed of a naive search.
      Code:
      version 18.0
      
      clear *
      
      // seedem
      set seed 421948205
      
      mata:
      mata set matastrict on
      
      real scalar function qbinomial(real scalar q, real scalar n, real scalar pi) {
      
          real scalar k
      
          if (pi < 0.5) {
              k = 0
              do {
                  k++
              } while (invbinomial(n, k, q) -  pi <= -1e-10)
          }
          else {
              k = n + 1
              while (invbinomial(n, k-1, q) - pi >= 1e-10) {
                  k--
              }
          }
      
          return(k)
      }
      end
      
      python:
      from scipy.stats import binom
      from sfi import Scalar, Macro
      
      end
      
      quietly set obs 1000
      quietly generate int mata_icdf = .
      quietly generate int python_icdf = .
      quietly generate double q = .
      quietly generate double pi = .
      
      tempname mata_icdf python_icdf n q pi
      
      timer clear
      #delimit ;
      forvalue i = 1/`=_N' {;
      
          scalar define `n' = runiformint(1000, 2000);
          scalar define `q' = runiform();
          scalar define `pi' = runiform();
      
          timer on 1;
          mata: st_numscalar(st_local("mata_icdf"),
              qbinomial(st_numscalar(st_local("q")),
              st_numscalar(st_local("n")), st_numscalar(st_local("pi"))));
          timer off 1;
      
          timer on 2;
          python: Scalar.setValue(Macro.getLocal('python_icdf'),
              binom.ppf(Scalar.getValue(Macro.getLocal('q')),
              Scalar.getValue(Macro.getLocal('n')),
              Scalar.getValue(Macro.getLocal('pi'))));
          timer off 2;
      
          quietly replace q = `q' in `i';
          quietly replace pi = `pi' in `i';
          quietly replace mata_icdf = `mata_icdf' in `i';
          quietly replace python_icdf = `python_icdf' in `i';
      };
      #delimit cr
      
      assert mata_icdf == python_icdf
      
      timer list
      display in smcl as result %3.0f r(t1) / r(t2)
      
      exit
      Do-file and log file attached if you're interested.
      Attached Files

      Comment


      • #4
        This

        Code:
        version 18
        
        mata :
        
        real scalar cdfbinomial(
            
            real scalar n,
            real scalar x,
            real scalar p
            
            )
        {
            real scalar mu
            real scalar sigma
            real scalar gamma
            real scalar z
            real scalar k
            
            
            mu    = n*p
            sigma = sqrt(n*p*(1-p))
            gamma = ((1-p)-p)/sigma
            
            z = invnormal(x)
            k  = floor(mu + sigma * (z + gamma*(z*z-1)/6) + 0.5)
            k  = min((k,n))
            
            if (binomial(n,k,p) >= x)
                while (binomial(n,k-1,p) > x)
                    k--
            else
                while (binomial(n,k,p) <= x)
                    k++
                
            return(k)
        }
        
        end
        is (supposed to be) a copy of the core functionality of the respective R algorithm.

        I can replicate Joseph's timings using his code in #3.

        Code:
        (output omitted)
        . timer list
           1:      9.15 /     1000 =       0.0091
           2:      0.17 /     1000 =       0.0002
        
        . display in smcl as result %3.0f r(t1) / r(t2)
         54
        Using the normal approximation as an entry point is about 50(+) times faster in Mata, too.
        Last edited by daniel klein; 12 Dec 2023, 04:12.

        Comment


        • #5
          Thank you all for these thoughtful and very helpful responses!

          Ariel

          Comment


          • #6
            Hi Daniel,

            I am not sure how to get a return scalar from this mata code so that I can proceed to do something with the result. I added st_numscalar("k", k) to your code below to convert mata's k to Stata's k. But that does not appear to return the scalar to Stata.

            I am guessing it is an easy fix that I am overlooking.

            Thanks!

            Ariel

            Code:
            version 17
            clear all
            
            mata :
            
            real scalar cdfbinomial(
                
                real scalar n,
                real scalar x,
                real scalar p
                
                )
            {
                real scalar mu
                real scalar sigma
                real scalar gamma
                real scalar z
                real scalar k
                
                
                mu    = n*p
                sigma = sqrt(n*p*(1-p))
                gamma = ((1-p)-p)/sigma
                
                z = invnormal(x)
                k  = floor(mu + sigma * (z + gamma*(z*z-1)/6) + 0.5)
                k  = min((k,n))
                
                if (binomial(n,k,p) >= x)
                    while (binomial(n,k-1,p) > x)
                        k--
                else
                    while (binomial(n,k,p) <= x)
                        k++
            
                return(k)
            
                st_numscalar("k", k)
            
            }
            
            end
            
            mata :cdfbinomial(2000, 0.005, 0.02241113)

            Code:
            . di k
            k not found
            r(111);

            Comment


            • #7
              Originally posted by Ariel Linden View Post
              I added st_numscalar("k", k) to your code below to convert mata's k to Stata's k. But that does not appear to return the scalar to Stata.
              The function returns before it gets there: put the st_numscalar() line before the return().

              If you're going to use the function in that way, you might want to make it a subroutine instead.
              Code:
              void function cdfbinomial(
                  
                  real scalar n,
                  real scalar x,
                  real scalar p
                  
                  )
              {
                  real scalar mu
                  real scalar sigma
                  real scalar gamma
                  real scalar z
                  real scalar k
                  
                  
                  mu    = n*p
                  sigma = sqrt(n*p*(1-p))
                  gamma = ((1-p)-p)/sigma
                  
                  z = invnormal(x)
                  k  = floor(mu + sigma * (z + gamma*(z*z-1)/6) + 0.5)
                  k  = min((k,n))
                  
                  if (binomial(n,k,p) >= x)
                      while (binomial(n,k-1,p) > x)
                          k--
                  else
                      while (binomial(n,k,p) <= x)
                          k++
              
                  st_numscalar("k", k)
              
                  // return(k)
              
              }

              Comment


              • #8
                I would not hardwire the name k into the function. In fact, I would not have the Mata function return a Stata scalar at all. It just complicates matters if you want to re-use the rather generic function in your next project in a different context. Instead, set up the scalar from within Stata, i.e., code

                Code:
                . mata : st_numscalar("k", invcdfbinomial(2000, 0.005, 0.02241113))
                . display k
                29

                Here is the definition of invcdfbinomial(), which is a better name for the function, that adds some basic argument validation:

                Code:
                version 16.1
                
                mata :
                
                real scalar invcdfbinomial(
                    
                    real scalar n,
                    real scalar x,
                    real scalar p
                    
                    )
                {
                    real scalar mu
                    real scalar sigma
                    real scalar gamma
                    real scalar z
                    real scalar k
                    
                    
                    if ( hasmissing((n,x,p)) )
                        return(.)
                    
                    if (n < 0)
                        _error(3300)
                    
                    if ( (x<0)|(x>1) )
                        _error(3300)
                    
                    if ( (p<0)|(p>1) )
                        _error(3300)
                    
                    mu    = n*p
                    sigma = sqrt(n*p*(1-p))
                    gamma = ((1-p)-p)/sigma
                    
                    z = invnormal(x)
                    k  = floor(mu + sigma * (z + gamma*(z*z-1)/6) + 0.5)
                    k  = min((k,n))
                    
                    if (binomial(n,k,p) >= x)
                        while (binomial(n,k-1,p) > x)
                            k--
                    else
                        while (binomial(n,k,p) <= x)
                            k++
                
                    return(k)
                }
                
                end

                Comment


                • #9
                  Thank you again, Joseph and Daniel! Mata is a completely different beast than Stata.

                  Ariel

                  Comment


                  • #10
                    Hi Daniel and Joseph,

                    Is it possible to change the input criteria of the mata function to accept a variable instead of the numeric value for n?

                    Code:
                     
                     real scalar invcdfbinomial(          real scalar n, // change to varname or varlist?     real scalar x,     real scalar p          )

                    Similarly, can the line of code that is used to pull the value back into Stata be changed to reflect that it is a variable and not a value?
                    Code:
                       
                     . mata : st_numscalar("k", invcdfbinomial(varname, 0.005, 0.02241113))
                    Thank you again! Ariel

                    Comment


                    • #11
                      Do you mean changing the scalar argument to a vector (or even to a matrix)? As far as I can tell, that's not straightforward. Although normal() and binomial() support matrix arguments, you would need to modify the optimization part to work with matrices.

                      The obvious but (very) slow workaround is

                      Code:
                      generate result = .
                      
                      levelsof varname , local(levels)
                      
                      foreach level of levels {
                          
                          mata : st_numscalar("k", invcdfbinomial(`level', #, #))
                          
                          replace result = scalar(k) if varname == `level'
                          
                      }

                      Comment


                      • #12
                        Hi Daniel,

                        That is actually the exact same code that I wrote to get around what I was hoping would be a simple change to the mata function you wrote.

                        Okay, if there is no more efficient approach, I'll stick with the foreach loop!

                        Thanks again for your help. You are a mata wizard!

                        Ariel

                        Comment


                        • #13
                          Hi All,

                          The same issue that exists with qbinom() not having a Stata equivalent also exists with R's qpois() function. Stata has invpoissontail() which returns values that are somewhat close to those of qpois() but not close enough. In the documentation for R's qpois(), it states that "qpois uses the Cornish–Fisher Expansion to include a skewness correction to a normal approximation, followed by a search." I am guessing that is why the results between R and Stata do not match.

                          Is there a way to get identical results in Stata?

                          Thanks in advance!

                          Ariel

                          Code:
                           
                           qpois(0.001, 36 * 0.9801219) = 18
                          Code:
                           . di invpoissontail(36 * 0.9801219, 0.001) = 19.518189

                          Comment


                          • #14
                            Adopting the code underlying the corresponding R algorithm, but omitting the more elaborated checks therein, this should work:

                            Code:
                            version 16.1
                            
                            mata :
                            
                            real scalar invcdfpoisson(
                                
                                real scalar m,
                                real scalar x
                                
                                )
                            {
                                real scalar mu
                                real scalar sigma
                                real scalar gamma
                                real scalar z
                                real scalar k
                                
                                
                                if ( hasmissing((m,x)) )
                                    return(.)
                                    
                                if (m < 0)
                                    _error(3300)
                                
                                if ( (x<0)|(x>1) )
                                    _error(3300)
                                
                                if (m == 0)
                                    return(0)
                                
                                if (x == 0)
                                    return(0)
                                
                                if (x == 1)
                                    return(.)
                                
                                mu    = m
                                sigma = sqrt(m)
                                gamma = 1/sigma
                                
                                z = invnormal(x)
                                k  = round(mu + sigma * (z + gamma*(z*z-1)/6) + 0.5)
                                
                                if (poisson(m,k) >= x)
                                    while ( k & (poisson(m,k-1)>x) )
                                        k--
                                else
                                    while (poisson(m,k) <= x)
                                        k++
                            
                                return(k)
                            }
                            
                            end
                            Code:
                            . mata : invcdfpoisson(36 * 0.9801219, 0.001)
                              18
                            I am not so sure about the edge cases, though.
                            Last edited by daniel klein; 22 Jan 2024, 06:13. Reason: corrected link; slight code polish

                            Comment


                            • #15
                              Thank you, again, Daniel! This works perfectly!

                              Comment

                              Working...
                              X