Using pctile (or xtile) to create a variable that reports which percentile a given observation falls into

Landon Webber

Join Date: Jul 2022

Posts: 5
#1

Using pctile (or xtile) to create a variable that reports which percentile a given observation falls into

10 Jul 2022, 14:10

I am trying to create a variable that reports the percentile that a given observation falls into within a data set. I have been trying to use the pctile and xtile codes, but cannot produce exactly what I am looking for because I am relatively new to this.

Goal: To produce a variable that will either (a) give the exact percentile of an observation within a data set or (b) (better yet) report the quantile an observation falls into.
Additional Caveats: 1) I'd like to be able to restrict the percentile calculation only to a certain subset/category of observations in a data set, without dropping observations. 2) If I'm able to create a variable that identifies the various quantiles in which observations fall into, I'd ideally like to do this for a predetermined set of particular quantiles I'm interested in.

Example code:

sysuse auto
//(1978 automobile data); there are 74 observations

pctile percentile = mpg, nq(4)
//Variable "percentile" produced with label "percentiles of mpg"

//Note: However, when I summarize the percentile variable, all I get is this:

summarize percentile

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
percentile | 3 21 3.605551 18 25

tab mpg percentile if percentile > 0

Mileage | percentiles of mpg
(mpg) | 18 20 25 | Total
-----------+---------------------------------+----------
17 | 0 1 0 | 1
22 | 1 0 1 | 2
-----------+---------------------------------+----------
Total | 1 1 1 | 3

tab percentile

percentiles |
of mpg | Freq. Percent Cum.
------------+-----------------------------------
18 | 1 33.33 33.33
20 | 1 33.33 66.67
25 | 1 33.33 100.00
------------+-----------------------------------
Total | 3 100.00

//First off, it only identifies three quantiles - 18, 20, and 25 - and only places the first three observations (one of which has a mileage of 17 mpg and two of which have a mileage of 22 mph) as falling into these three quantiles. I try the option genp:

drop percentile
pctile percentile = mpg, nq(4) genp(percent)

//Still this only creates percentages for the same three variables that were categorized into quantiles with the commands above. The results are below:

tab mpg percent if percent > 0

Mileage | percentages for __000002
(mpg) | 25 50 75 | Total
-----------+---------------------------------+----------
17 | 0 1 0 | 1
22 | 1 0 1 | 2
-----------+---------------------------------+----------
Total | 1 1 1 | 3

What I cannot figure out how to do is create a command that will place all 74 observations into 9 distinct quantiles - 1st percentile, 5th percentile, 10th percentile, 25th percentile, 50th percentile, 75th percentile, 95th percentile, and 99th percentile. I would like for this variable to populate with numeric data. For example, an observation that falls into the 25th percentile would have a value of "25" in the column corresponding to this variable. Even if I can get code that simply gives the percentile for an observation, I can easily write code that groups the percentiles into quantiles. I've tried altering the nq option, but it appears that this option only limits the number of observations placed into a percentile to "q-1" and that nq does not represent the number of quintiles you want to have.

To complicate things, I need to calculate percentiles for mpg within categories or subsets of the data, in addition to percentiles for each observation overall. The example I have (but have been unsuccessful in trying to execute) is the foreign variable in this data set. I would like to calculate an overall percentile for mpg for all vehicles (in the data set), while also calculating the percentiles of a given foreign car among only other foreign cars and of a given domestic car among only other domestic cars. I have tried using the by var: and if foreign == 0, but I am not getting what I need. Ideally, for the foreign only percentile, I would want an observation to show "25" in the cell if it falls into the 25th percentile for foreign only cars. I am aware (and somewhat expect) that the same observation might fall into multiple quantiles depending on the sample being measured (overall, foreign, or domestic).

Any help would be greatly appreciated!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30191
#2

10 Jul 2022, 14:21

Code:

sysuse auto, clear local dq 1 5 10 25 50 75 95 99 levelsof rep78, local(groups) gen wanted = . foreach g of local groups { summ mpg if rep78 == `g', detail return list foreach d of local dq { replace wanted = `d' if rep78 == `g' & mpg <= `r(p`d')' & missing(wanted) } }

I did the grouping by rep78 instead of by foreign.

Note: This code exploits the fact that the percentiles in which O.P. expressed interest are precisely the ones returned by -summ, detail-, thus avoiding more complicated code working with the results of -pctile-.
1 like
Comment
Landon Webber

Join Date: Jul 2022

Posts: 5
#3

10 Jul 2022, 14:45

Thank you! This is incredibly helpful.

I should have put my next-level question in as well: If I wanted to do the grouping by a subset created using two variables instead of one, would this code still work? For example if I wanted to do a percentile calculation for every possible combination of foreign and rep78? Alternatively, how could it work if I wanted to do it only for those observations where the car was foreign and rep78 == 3 (just for the sake of example)?
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30191

10 Jul 2022, 17:13

For the combination of foreign and rep78:

Code:

sysuse auto, clear

local dq 1 5 10 25 50 75 95 99

levelsof rep78, local(groups)

gen wanted = .
foreach g of local groups {
    levelsof foreign if rep78 == `g', local(groups2)
    foreach g2 of local groups2 {
        summ mpg if rep78 == `g' & foreign == `g2', detail
        return list
        foreach d of local dq {
            replace wanted = `d' if rep78 == `g' & foreign == `g2' ///
                & mpg <= `r(p`d')' & missing(wanted)
        }
    }
}

This is as complicated as this because there are combinations of values of foreign and rep78 that never co-occur, so the code has to deal with that.

If you just want to do it for rep78 == 3, then it's much simpler code:

Code:

sysuse auto, clear

local dq 1 5 10 25 50 75 95 99


gen wanted = .
summ mpg if rep78 == 3, detail
return list
foreach d of local dq {
    replace wanted = `d' if rep78 == 3 & mpg <= `r(p`d')' & missing(wanted)
}

Comment

Landon Webber

Join Date: Jul 2022

Posts: 5
#5

11 Jul 2022, 11:48

Thank you again for the second response! Super helpful.
Comment

Announcement

Using pctile (or xtile) to create a variable that reports which percentile a given observation falls into

Comment

Comment

Comment

Comment