Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Numerical precision issues with Stata's pctile and altdef in IC and SE

    (This is cross-posted here: https://stackoverflow.com/questions/...f-in-ic-and-se)

    Stata's `altdef` formula in `pctile` gives the wrong result for certain certain numbers in IC and SE (this will also affect `xtile` one the bug with `altdef` there is fixed).

    Code:
    clear
    set obs 89750
    gen double x = 7.2439548890446011
    
    pctile fp = x, nq(500) altdef
    pctile double dp = x, nq(500) altdef
    
    assert (x[1] == fp) | mi(fp)
    assert (x[1] == dp) | mi(dp)
    The above assertions should be true, or at least the second one, but both fail. (Note that in Stata/MP, the second assertion goes through; at least that was the case for me in testing). We can see that

    Code:
    . levelsof fp
    7.243954658508301
    
    . levelsof dp
    7.2439548890446 7.243954889044601 7.243954889044602
    This happens because `altdef` takes an average. The formula is:

    Code:
    scalar perc = 100 * 148 / 500
    scalar ith  = (_N + 1) * perc / 100
    scalar i    = floor(ith)
    scalar h    = ith - i
    scalar q    = (1 - h) * x[i] + h * x[i + 1]
    
    assert x[i] == x[i - 1]
    assert q == dp[148]
    assert q == x[i]
    The first two assertions succeeded but the third fails. Stata's `pctile` fails to recognize that `x[i]` is equal to `x[i - 1]`.

    (Note: Naturally my actual use case involved a variable that had different values, but one of them was `7.2439548890446011` and that caused the problem.)
    Stata's altdef formula in pctile gives the wrong result for certain certain numbers in IC and SE (this will also affect xtile one the bug with altdef there is fixed). clear set obs 89750 gen doubl...

  • #2
    I suspect the problem could be corrected by changing (the equivalent in Stata's code of)
    Code:
    scalar q = (1 - h) * x[i] + h * x[i + 1]
    to (the equivalent of)
    Code:
    scalar q = x[i] + h * (x[i + 1] - x[i])
    But realistically, this is not an error, it's just a particularly noticeable instance of the imprecision caused by truncated representation of hexidecimal fractions. The rewritten version captures the noticeable problem when x[i+1]==x[i], that is all.
    Last edited by William Lisowski; 18 Nov 2017, 15:57.

    Comment


    • #3
      I mean, if it gives the wrong answer it is an error, woudn't you say? Specially when rewriting the formula would give the correct answer (as you point out).

      The issue is not how to fix it manually. The issue is if I want to use the built-in pctile, _pctile, or xtile (with xtile in particular, it gives the wrong category, which is how I noticed this in the first place). I think there must be a way to fix it since I don't get the issue in MP, just IC and SE.

      Comment


      • #4
        While there are people from StataCorp who follow this forum regularly, I think this is probably something you should call directly to the attention of technical support. Send them a link to this thread, and also information about your setup(s) (basically the output from the -about- command).

        I am intrigued that you get this result in IC and SE but not MP. It is hard for me to understand how that could happen. Are these all on the same machine?

        Comment


        • #5
          Consider the following.
          Code:
          . scalar x1 = 1.1
          
          . scalar x2 = 4.4
          
          . scalar z1 = 3.3
          
          . scalar z2 = x1 + (2/3)*(x2-x1)
          
          . display x1 " " %21x x1
          1.1 +1.199999999999aX+000
          
          . display x2 " " %21x x2
          4.4 +1.199999999999aX+002
          
          . display z1 " " %21x z1
          3.3 +1.a666666666666X+001
          
          . display z2 " " %21x z2
          3.3 +1.a666666666667X+001
          
          . assert z1==z2
          assertion is false
          Note that this uses the rewritten code, and yet for this calculation fails in the same way that the original calculation fails.

          The problem is not an error, it is an unavoidable consequence of imprecise representation of nonterminating fractional values. Consider now
          Code:
          . scalar x1 = 111/100
          
          . scalar x2 = x1-1
          
          . scalar x3 = 11/100
          
          . display x1 " " %21x x1
          1.11 +1.1c28f5c28f5c3X+000
          
          . display x2 " " %21x x2
          .11 +1.c28f5c28f5c30X-004
          
          . display x3 " " %21x x3
          .11 +1.c28f5c28f5c29X-004
          
          . assert x2==x3
          assertion is false
          For further discussion, see the output of Stata's help precision command, and for even more detail, this linked Stata Blog post.

          Comment


          • #6
            But if these are all being done on the same machine, I don't see why it should make any difference which flavor of Stata is being used: that suggests to me that there is something else going on beyond a routine precision issue.

            Comment


            • #7
              I'm not sure that we know all three were done on the same machine, or same hardware architecture, but I too am moderately curious if the results would differ between Stata flavors on the same machine.

              However, that has no bearing on the general principle that comparisons of two fractional values arrived at through different calculations being fraught with the potential for differences of truly trivial magnitudes, and those differences are not indicative of a software "error".

              Let me note that the original post claims, in the first of the two assert commands, that effectively float(7.2439548890446011) should equal 7.2439548890446011, despite the lesser precision of the float data type, and the demo below shows that the first assert command never came close to being satisfied.
              Code:
              . clear
              
              . set obs 89750
              number of observations (_N) was 0, now 89,750
              
              . gen double x = 7.2439548890446011
              
              . 
              . pctile fp = x, nq(500) altdef
              
              . pctile double dp = x, nq(500) altdef
              
              . 
              . display fp " " %21x fp
              7.2439547 +1.cf9cf40000000X+002
              
              . display dp " " %21x dp
              7.2439549 +1.cf9cf4f78955fX+002

              Comment


              • #8
                The SE and MP discrepancy happens on the same machine, yes. The IC result, however, is only for my laptop, since I don't have another version of Stata locally.

                It is an error because this is not about computing arbitrarily precise operations with fractions. It's about percentiles. And the important property of a percentile is not that it's arbitrarily precise. That's not actually the problem. The important property is that a portion of the data is below that number. Obviously with a finite number of values percentiles will always be approximate, but there is no reason not to be more precise if it is possible. Consider this:


                Code:
                clear
                set obs 8975
                set seed 1729
                gen double x = rnormal() * 5 + 10
                expand 10
                pctile p = x, nq(500) altdef
                xtile tile = x, cutpoints(p)
                
                preserve
                    contract tile
                    tab _freq
                restore
                This gives

                Code:
                  Frequency |      Freq.     Percent        Cum.
                ------------+-----------------------------------
                        170 |        137       27.40       27.40
                        180 |        251       50.20       77.60
                        190 |        112       22.40      100.00
                ------------+-----------------------------------
                      Total |        500      100.00

                But we can do better.

                Code:
                mata
                nq = 500
                N  = `=_N'
                X  = st_data(., "x")
                N  = rows(X)
                _sort(X, 1)
                
                quantiles = ((1::(nq - 1)) * (N + 1) / nq)
                qpos_i    = floor(quantiles)
                qpos_i1   = qpos_i :+ 1
                qdiff     = (quantiles - qpos_i)
                
                zeros = selectindex(qpos_i :== 0)
                if ( rows(zeros) > 0 ) {
                    qpos_i[zeros] = J(rows(zeros), 1, 1)
                }
                
                Ns = selectindex(qpos_i1 :== (N + 1))
                if ( rows(Ns) > 0 ) {
                    qpos_i1[Ns] = J(rows(Ns), 1, N)
                }
                
                X_i  = X[qpos_i]
                X_i1 = X[qpos_i1]
                P    = X_i :+ qdiff :* (X_i1 :- X_i)
                
                st_addvar("double", "pmata")
                st_store((1::(nq - 1)), "pmata", P)
                end
                
                xtile tmata = x, cutpoints(pmata)
                preserve
                    contract tmata
                    tab _freq
                restore
                This gives


                Code:
                  Frequency |      Freq.     Percent        Cum.
                ------------+-----------------------------------
                        170 |         25        5.00        5.00
                        180 |        475       95.00      100.00
                ------------+-----------------------------------
                      Total |        500      100.00
                A marked improvement, wouldn't you agree?

                Comment


                • #9
                  Originally posted by William Lisowski View Post
                  Let me note that the original post claims, in the first of the two assert commands, that effectively float(7.2439548890446011) should equal 7.2439548890446011, despite the lesser precision of the float data type, and the demo below shows that the first assert command never came close to being satisfied.
                  Clearly float(X) need not equal double(X). My intention was merely expository and I did not meant to make that claim implicitly or explicitly. Apologies for the confusion. See my post above for a better example of why this is a problem (I think).

                  Comment


                  • #10
                    The SE and MP discrepancy happens on the same machine
                    An interesting experiment would be to add
                    Code:
                    set processors 1
                    to your test on MP. In one of my earliest Statalist topics I discovered performance issues in xtlogit when compared to melogit on my SE system. Stata Technical Services duplicated the problem on their MP system, but only by forcing Stata to run the non-MP code behind xtlogit using the above technique. I suspect that there may be a similar fork in the code base for the compiled (that is, non-ado) commands at the heart of these commands.

                    If indeed running MP on a "single processor" system recreates the problem, you've got a good reproducible example worth sharing with Technical Services, as Clyde suggested. Perhaps the implementation of the multiprocessor enhancements introduced code improvments that were not carried back to the (original?) code that is executed in single-processor mode.

                    Consider also this variant on your example from post #8, which shows that even Stata SE can do better on your sample data by using xtile directly than by using pctile to create cutoffs and the using xtile to apply those cutoffs to the data. Interesting and not immediately explicable to me, even after a short glance at the xtile source code.
                    Code:
                    clear
                    set obs 8975
                    set seed 1729
                    gen double x = rnormal() * 5 + 10
                    expand 10
                    xtile tile = x, nq(500) altdef
                    
                    contract tile
                    tab _freq
                    Code:
                      Frequency |      Freq.     Percent        Cum.
                    ------------+-----------------------------------
                            170 |         25        5.00        5.00
                            180 |        475       95.00      100.00
                    ------------+-----------------------------------
                          Total |        500      100.00

                    Comment


                    • #11
                      I just realized that in my post above, the results are comparing a double and a float. The correct comparison is:

                      Code:
                      clear
                      set obs 8975
                      set seed 1729
                      gen double x = rnormal() * 5 + 10
                      expand 10
                      pctile double p = x, nq(500) altdef
                      xtile tile = x, cutpoints(p)
                      
                      preserve
                          contract tile
                          tab _freq
                      restore
                      which gives

                      Code:
                        Frequency |      Freq.     Percent        Cum.
                      ------------+-----------------------------------
                              170 |         33        6.60        6.60
                              180 |        459       91.80       98.40
                              190 |          8        1.60      100.00
                      ------------+-----------------------------------
                            Total |        500      100.00
                      Which is better but still off. I will also point out that without `altdef`, you get a similar distribution with xtile vs using the corrected formula.

                      Code:
                      pctile double p2 = x, nq(500)
                      xtile tile2 = x, cutpoints(p2)
                      
                      preserve
                          contract tile2
                          tab _freq
                      restore
                      This also gives

                      Code:
                        Frequency |      Freq.     Percent        Cum.
                      ------------+-----------------------------------
                              170 |         25        5.00        5.00
                              180 |        475       95.00      100.00
                      ------------+-----------------------------------
                            Total |        500      100.00
                      But, as expected, tile2 and tmata are different. (I also tested this in the machine where I have Stata SE and MP; the issue is, again, not present in MP once I correctly create "p" as a double).

                      Comment


                      • #12
                        Originally posted by William Lisowski View Post
                        Consider also this variant on your example from post #8, which shows that even Stata SE can do better on your sample data by using xtile directly than by using pctile to create cutoffs and the using xtile to apply those cutoffs to the data. Interesting and not immediately explicable to me, even after a short glance at the xtile source code.
                        This may not correct. Stata's built-in xtile ignores "altdef", which is why I did not use it in my example. I have reported this bug here: https://www.statalist.org/forums/for...-option-altdef

                        So this is comparing xtile without altdef, which is better than xtile with altdef (once it is fixed). Also,

                        Originally posted by William Lisowski View Post
                        An interesting experiment would be to add
                        Code:
                        set processors 1
                        I did do this, but the results were the same.

                        Comment


                        • #13
                          I would suggest that the problem could be avoided by not working at the limits of Stata's numerical precision, but instead converting the values you are working with from double to float. That is essentially the point made in the discussion of "false precision" made in the Stata Blog post I linked to at #5 above.

                          Comment


                          • #14
                            I'll keep that in mind. I am prone to working with doubles, so stuff like this comes up... As I note a few posts back, pctile wasn't actually the issue, but stuff like that eventually trickles down to xtile and if conditions and assert statements. I don't mind pctile being 1e-5 off, or whatever it actually is, but I mind xtile giving different inconsistent results with identical inputs.

                            My original posts did not reflect this, so I think we had some back and forth about different things because I was unclear why this was actually a problem. Apologies for that.

                            Comment

                            Working...
                            X