Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting the fraction of (X, Z) pairs with the property that X > Z

    Consider two variables X , Z (these can have a different number of non-missing observations). I am trying to count the fraction of all possible (X, Z) pairs that have the property that X > Z. (Actually, I am trying to do something a bit more complicated but this should be a good warm-up!)

    For example, suppose that my dataset is:
    X Z
    1 0
    . 2
    (Here, X has one missing observation.) In this case, there are two possible pairs, i.e. (1, 0) and (1, 2), and X > Z in 1/2 of the cases.

    In Python, one could do this by writing something like:

    HTML Code:
    pairs = 0
    x_exceeds_z = 0
    for x in x_list:
           for z in z_list:
                  pairs += 1
                  if x > z:
                         x_exceeds_z += 1
    print(x_exceeds_z/pairs)
    However, I have no idea how to do this in STATA. Is it easy to do?

    If I may a second question, I will ultimately want to bootstrap (a more complicated version of) this estimate. Is this also easy to do in STATA?

    Thanks in advance for any suggestions or pointers.
    Last edited by Itzhak Rasooly; 01 Feb 2024, 12:27.

  • #2
    Code:
    count if X > Z & !missing(X, Z)
    local numerator `r(N)'
    count
    local denominator `r(N)'
    display as text "Fraction of pairs with X > Z = " as result  `=`numerator'/`denominator''
    Added: Stata is really not like Python. It requires a different way of thinking about data and its organization. It's primitives are higher order objects than those of Python. I suggest you take some time to get an overview of Stata's approach. Your Stata installation comes with PDF user manuals. You can find them in Stata's Help menu. Read the User's Guide [U] and Getting Started [GS] manuals to get a sense of how things work in Stata.
    Last edited by Clyde Schechter; 01 Feb 2024, 12:38.

    Comment


    • #3
      Hi Clyde, many thanks for the suggestion! However, I think your code may only compare (X, Z) values in the same row? Indeed, when I checked X = (1, 2, 3), Z = (2, 3, 4), your code seemed to give the answer of 0, which is not correct. (To clarify, I want to consider all pairs X_i, Z_j where i and j can take all feasible values.)

      Comment


      • #4
        I want to consider all pairs X_i, Z_j where i and j can take all feasible values
        I did not understand your original question. I thought you wanted to only consider X_i with Z_i.

        So it's a little more complicated:
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input byte(x z)
        1 0
        . 2
        end
        
        preserve
        keep z
        tempfile zs
        save `zs'
        restore
        drop z
        
        cross using `zs'
        count if x > z & !missing(x, z)
        local numerator `r(N)'
        count
        local denominator `r(N)'
        display as text "Fraction of pairs with X > Z = " as result  `=`numerator'/`denominator''
        If x and z, when not missing, are always integers, then there is a nicer way. Post back if that's the case.

        Comment


        • #5
          On reflection, the approach in #4, while appropriate for a small data set, will be far too demanding of both memory and computation time in a large data set. It's a brute force counting method.

          Here's something a bit more efficient
          Code:
          count
          local denominator = r(N)^2
          rename (x z) (x0 x1)
          gen `c(obs_t)' obs_no = _n
          reshape long x, i(obs_no)
          drop if missing(x)
          rangestat (sum) dominated_zs = _j, interval(x . x)
          rangestat (sum) equal_zs = _j, interval(x 0 0)
          replace dominated_zs = dominated_zs - equal_zs
          summ dominated_zs if _j == 0, meanonly
          local numerator `r(sum)'
          display as text "Fraction of pairs with X > Z = " as result  `=`numerator'/`denominator''
          -rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. It is available from SSC.

          This approach is quite similar to what I had in mind for use if x and z are always integers, and there would be no noticeable incremental benefit to the further modifications that could be made in that case.

          This approach destroys the data initially in memory. So if you need to retain the original data, -preserve- it before doing this, and -restore- it at the end.

          Comment


          • #6
            If I understand what Itzhak Rasooly is asking for, I think I would use -fillin-.

            Code:
            . * Read in the data from #1
            . clear
            
            . input byte (x z)
            
                        x         z
              1. 1 0
              2. . 2
              3. end
            
            . * Generate all x-z pairs
            . fillin x z // generate all x-z pairs
            
            . * Flag the pairs where x > z
            . generate byte XgtZ = x > z if !missing(x, z)
            (2 missing values generated)
            
            . quietly summarize XgtZ // mean = p(x > z)
            
            . generate pXgtZ = r(mean)
            
            . list, clean noobs
            
                x   z   _fillin   XgtZ   pXgtZ  
                1   0         0      1      .5  
                1   2         1      0      .5  
                .   0         1      .      .5  
                .   2         0      .      .5  
            
            . drop if _fillin // If you want to revert to the original dataset
            (2 observations deleted)
            
            .
            . * Read in the data from #3
            . clear
            
            . input byte (x z)
            
                        x         z
              1. 1 2
              2. 2 3
              3. 3 4
              4. end
            
            . * Generate all x-z pairs
            . fillin x z // generate all x-z pairs
            
            . * Flag the pairs where x > z
            . generate byte XgtZ = x > z if !missing(x, z)
            
            . quietly summarize XgtZ // mean = p(x > z)
            
            . generate pXgtZ = r(mean)
            
            . list, clean noobs
            
                x   z   _fillin   XgtZ      pXgtZ  
                1   2         0      0   .1111111  
                1   3         1      0   .1111111  
                1   4         1      0   .1111111  
                2   2         1      0   .1111111  
                2   3         0      0   .1111111  
                2   4         1      0   .1111111  
                3   2         1      1   .1111111  
                3   3         1      0   .1111111  
                3   4         0      0   .1111111  
            
            . drop if _fillin // If you want to revert to the original dataset
            (6 observations deleted)


            --
            Bruce Weaver
            Email: [email protected]
            Version: Stata/MP 18.5 (Windows)

            Comment


            • #7
              Dear both, thanks so much for the code! Clyde Schechter Unfortunately, X and Z need not be integers.

              Comment

              Working...
              X