I was looking to understand how Stata computes weighted percentiles. I tried to follow the formula in the documentation (here: https://www.stata.com/manuals13/dpctile.pdf) but I get different results:
I'm not entirely sure why it gives that answer. It seems incorrect:
This shows that the first x such that the cumulative weighted sum is is larger than P is 997; however, Stata seems to think that this requires averaging with the prior observation, which should only happen if 49850.32031 is equal to 49850.55859.
I thought this might be a numerical precision issue, but I get the same result using doubles (the cum sum is 49850.31956 and P is 49850.56176). It is interesting to note that collapse gives what I understand to be the right answer:
Any insights?
Code:
. set seed 1729 . clear . set obs 100000 number of observations (_N) was 0, now 100,000 . gen w = 100 * runiform() . gen long x = _n . sort x w . gen wsum = sum(w) . gen P = wsum[_N] * 1 / 100 . assert abs(wsum - P) > 0.1 . . gen gt = (wsum > P) . qui sum x if gt . disp `r(min)' 997 . . _pctile x [aw = w], p(1) . return list scalars: r(r1) = 996.5
Code:
. format %21.5f wsum P . disp P 49850.559 . qui count if wsum < P . list in `=r(N) - 1' / `=r(N) + 1' +-------------------------------------------------+ | w x wsum P gt | |-------------------------------------------------| 995. | 22.8744 995 49817.42578 49850.55859 0 | 996. | 32.89221 996 49850.32031 49850.55859 0 | 997. | 25.59445 997 49875.91406 49850.55859 1 | +-------------------------------------------------+
I thought this might be a numerical precision issue, but I get the same result using doubles (the cum sum is 49850.31956 and P is 49850.56176). It is interesting to note that collapse gives what I understand to be the right answer:
Code:
. collapse (p1) x [aw = w] . disp x 997