Why is Stata calculating percentiles in the way it does, and who has said that this is the way to calculate percentiles?

Joro Kolev

Join Date: Aug 2018

Posts: 3047
#1

Why is Stata calculating percentiles in the way it does, and who has said that this is the way to calculate percentiles?

08 Mar 2021, 05:28

The definition of a percentile I have been taught, is

(1) Q(p) = F^(-1)(p) = inf{x: F(x)>=p}, 0<p<1,
Where F(x) is the cumulative distribution function and F^(-1)(p) is the inverse cumulative distribution function.
e.g., the very first definition of Rob J. Hyndman & Yanan Fan (1996) Sample Quantiles in Statistical Packages, The American Statistician, 50:4, 361-365.

The manual of -xtile- Methods and Formulas p. 584 described some algorithm but does not give a reference to a textbook in statistics, or an article in statistics which derives this algorithm, or explains why this algorithm makes sense.

And the Stata algorithm does not agree (as far as I can see) with the definition above from Hyndman & Fan (1996, p.361).

Take this example here:

Code:

. sysuse auto, clear (1978 Automobile Data) . keep price . keep in 1/20 (54 observations deleted) . sort price . cumul price, gen(cumprice) . list, sep(5) +-------------------+ | price cumprice | |-------------------| 1. | 3,299 .05 | 2. | 3,667 .1 | 3. | 3,799 .15 | 4. | 3,955 .2 | 5. | 3,984 .25 | |-------------------| 6. | 4,082 .3 | 7. | 4,099 .35 | 8. | 4,453 .4 | 9. | 4,504 .45 | 10. | 4,749 .5 | |-------------------| 11. | 4,816 .55 | 12. | 5,104 .6 | 13. | 5,189 .65 | 14. | 5,705 .7 | 15. | 5,788 .75 | |-------------------| 16. | 7,827 .8 | 17. | 10,372 .85 | 18. | 11,385 .9 | 19. | 14,500 .95 | 20. | 15,906 1 | +-------------------+

As far as the definition in eq.(1) is concerned, here the 25th percentile is 3,984 , 50th is 4,749, 75th is 5,788. This is not what -_pctile- returns:

Code:

. _pctile price, perc(25 50 75) . return list scalars: r(r1) = 4033 r(r2) = 4782.5 r(r3) = 6807.5

What -_pctile- has done according to Stata's definition on p. 584 (Methods and Formulas -xtile-) is the following:

Code:

. dis (3984+4082)/2 4033 . dis (4749+4816)/2 4782.5 . dis (5788+7827)/2 6807.5

My question is why, and who has said that the algorithm that Stata is implementing is the way to calculate percentiles?
Tags: None
Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#2

08 Mar 2021, 06:39

The last sentence of the paper cited suggest:

We believe there is a similar need
to adopt a standard sample quantile definition, and we propose that Qs(p) is the best choice.

which appears to be the choice Stata has made. It may have been chosen because it makes the 50th percentile equal to customary definitions of the median, but "why" questions don't alwyas have an answer. If you prefer a different definition, quantiles are fairly easy to calculate in Stata. For example:

Code:

sort x gen ptile = int(100*(_n-1)/_N)+1

avoids the averaging of adjacent values.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3047
#3

08 Mar 2021, 07:29

There are two separate issues here.
1) One is how we define a quantile, and as far as I can see Stata's definition does not agree with eq.(1), which is the definition I am familiar with.
2) Two is what ad hoc interpolations we do when no number in the sample fits our definition.

The data sample I showed above has the following features: it is of N=20 points, and the variable price has all distinct values.

In this situation there are unique values in the data that delineate halves (median = 4,749), then quartiles, quintiles, deciles and ventiles. Again there are for all of those unique values which fit the definition eq.(1) of quartiles, quintiles, deciles and ventiles. Without any fudging we have the values already. However, Stata gives some different values. This is the issue that I am having in mind.

The second and separate issue is what fudging we do when there is no unique value satisfying the definition eq.(1).

My question is about 1. The case when there are unique values in the sample fitting our defition.

Originally posted by [email protected] View Post

The last sentence of the paper cited suggest:

which appears to be the choice Stata has made. It may have been chosen because it makes the 50th percentile equal to customary definitions of the median, but "why" questions don't alwyas have an answer. If you prefer a different definition, quantiles are fairly easy to calculate in Stata. For example:

Code:

sort x gen ptile = int(100*(_n-1)/_N)+1

avoids the averaging of adjacent values.

Last edited by Joro Kolev; 08 Mar 2021, 07:33.
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3000
#4

08 Mar 2021, 07:37

See also the discussion here.
1 like
Comment
Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#5

08 Mar 2021, 07:40

Now your point is much clearer. Stata's choice is a puzzle.
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2389

08 Mar 2021, 12:56

Just to note an observation here, but SAS has done the same think (by default), in that percentiles are always calculated from the empirical CDF with averaging (regardless of whether interpolation is strictly needed). The default method used is #5, whose output is:

Code:

     +------------------------+
     | pctl   price     defn5 |
     |------------------------|
  1. |    5    3299      3483 |
  2. |   10    3667      3733 |
  3. |   15    3799      3877 |
  4. |   20    3955    3969.5 |
  5. |   25    3984      4033 |
     |------------------------|
  6. |   30    4082    4090.5 |
  7. |   35    4099      4276 |
  8. |   40    4453    4478.5 |
  9. |   45    4504    4626.5 |
 10. |   50    4749    4782.5 |
     |------------------------|
 11. |   55    4816      4960 |
 12. |   60    5104    5146.5 |
 13. |   65    5189      5447 |
 14. |   70    5705    5746.5 |
 15. |   75    5788    6807.5 |
     |------------------------|
 16. |   80    7827    9099.5 |
 17. |   85   10372   10878.5 |
 18. |   90   11385   12942.5 |
 19. |   95   14500     15203 |
 20. |  100   15906     15906 |
     +------------------------+

No justification is given as to choosing this method as the default either. In this respect, Stata is consistent with the other major software defaults.

Comment

Joro Kolev

Join Date: Aug 2018

Posts: 3047
#7

09 Mar 2021, 04:51

Thank you, Leonardo, for this very useful information.

So apparently this is an industry choice, and as [email protected] hinted in #2, the industry probably does it so that the median in cases such as sets {1,2,3,4} comes not 2, but rather an average of 2 and 3.

Originally posted by Leonardo Guizzetti View Post

Just to note an observation here, but SAS has done the same think (by default), in that percentiles are always calculated from the empirical CDF with averaging (regardless of whether interpolation is strictly needed). The default method used is #5, whose output is:

Code:

+------------------------+ | pctl price defn5 | |------------------------| 1. | 5 3299 3483 | 2. | 10 3667 3733 | 3. | 15 3799 3877 | 4. | 20 3955 3969.5 | 5. | 25 3984 4033 | |------------------------| 6. | 30 4082 4090.5 | 7. | 35 4099 4276 | 8. | 40 4453 4478.5 | 9. | 45 4504 4626.5 | 10. | 50 4749 4782.5 | |------------------------| 11. | 55 4816 4960 | 12. | 60 5104 5146.5 | 13. | 65 5189 5447 | 14. | 70 5705 5746.5 | 15. | 75 5788 6807.5 | |------------------------| 16. | 80 7827 9099.5 | 17. | 85 10372 10878.5 | 18. | 90 11385 12942.5 | 19. | 95 14500 15203 | 20. | 100 15906 15906 | +------------------------+

No justification is given as to choosing this method as the default either. In this respect, Stata is consistent with the other major software defaults.
Comment

FernandoRios

Join Date: Apr 2014
Posts: 2430

09 Mar 2021, 05:22

To add to the conversation. I think another reason interpolation is needed is because the "percentile" formula being used internally is different:

Code:

sysuse auto, clear
keep price
keep in 1/20
sort price
gen per=(2*_n-1)/(2*_N)
list, sep(0)
_pctile price, p(27.5)


. list, sep(0)

     +---------------+
     |  price    per |
     |---------------|
  1. |  3,299   .025 |
  2. |  3,667   .075 |
  3. |  3,799   .125 |
  4. |  3,955   .175 |
  5. |  3,984   .225 |
  6. |  4,082   .275 |
  7. |  4,099   .325 |
  8. |  4,453   .375 |
  9. |  4,504   .425 |
 10. |  4,749   .475 |
 11. |  4,816   .525 |
 12. |  5,104   .575 |
 13. |  5,189   .625 |
 14. |  5,705   .675 |
 15. |  5,788   .725 |
 16. |  7,827   .775 |
 17. | 10,372   .825 |
 18. | 11,385   .875 |
 19. | 14,500   .925 |
 20. | 15,906   .975 |
     +---------------+
. _pctile price, p(27.5)
. return list
scalars:
                 r(r1) =  4082

Comment

Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#9

09 Mar 2021, 05:41

On reflection, there is a reasonable intuition behind this choice if Stata thinks of the data as samples from a continuous distribution, and that Stata is being asked for the quantiles of that underlying distribution. Of course Stata is only using two points to estimate the local slope of the distribution function. If it used more points, it could replace the one-half with something more appropriate.. What surprises me now is that so many packages have settled on the same approximation.
Comment

Announcement