I am trying to create a variable that reports the percentile that a given observation falls into within a data set. I have been trying to use the pctile and xtile codes, but cannot produce exactly what I am looking for because I am relatively new to this.
Goal: To produce a variable that will either (a) give the exact percentile of an observation within a data set or (b) (better yet) report the quantile an observation falls into.
Additional Caveats: 1) I'd like to be able to restrict the percentile calculation only to a certain subset/category of observations in a data set, without dropping observations. 2) If I'm able to create a variable that identifies the various quantiles in which observations fall into, I'd ideally like to do this for a predetermined set of particular quantiles I'm interested in.
Example code:
sysuse auto
//(1978 automobile data); there are 74 observations
pctile percentile = mpg, nq(4)
//Variable "percentile" produced with label "percentiles of mpg"
//Note: However, when I summarize the percentile variable, all I get is this:
summarize percentile
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
percentile | 3 21 3.605551 18 25
tab mpg percentile if percentile > 0
Mileage | percentiles of mpg
(mpg) | 18 20 25 | Total
-----------+---------------------------------+----------
17 | 0 1 0 | 1
22 | 1 0 1 | 2
-----------+---------------------------------+----------
Total | 1 1 1 | 3
tab percentile
percentiles |
of mpg | Freq. Percent Cum.
------------+-----------------------------------
18 | 1 33.33 33.33
20 | 1 33.33 66.67
25 | 1 33.33 100.00
------------+-----------------------------------
Total | 3 100.00
//First off, it only identifies three quantiles - 18, 20, and 25 - and only places the first three observations (one of which has a mileage of 17 mpg and two of which have a mileage of 22 mph) as falling into these three quantiles. I try the option genp:
drop percentile
pctile percentile = mpg, nq(4) genp(percent)
//Still this only creates percentages for the same three variables that were categorized into quantiles with the commands above. The results are below:
tab mpg percent if percent > 0
Mileage | percentages for __000002
(mpg) | 25 50 75 | Total
-----------+---------------------------------+----------
17 | 0 1 0 | 1
22 | 1 0 1 | 2
-----------+---------------------------------+----------
Total | 1 1 1 | 3
What I cannot figure out how to do is create a command that will place all 74 observations into 9 distinct quantiles - 1st percentile, 5th percentile, 10th percentile, 25th percentile, 50th percentile, 75th percentile, 95th percentile, and 99th percentile. I would like for this variable to populate with numeric data. For example, an observation that falls into the 25th percentile would have a value of "25" in the column corresponding to this variable. Even if I can get code that simply gives the percentile for an observation, I can easily write code that groups the percentiles into quantiles. I've tried altering the nq option, but it appears that this option only limits the number of observations placed into a percentile to "q-1" and that nq does not represent the number of quintiles you want to have.
To complicate things, I need to calculate percentiles for mpg within categories or subsets of the data, in addition to percentiles for each observation overall. The example I have (but have been unsuccessful in trying to execute) is the foreign variable in this data set. I would like to calculate an overall percentile for mpg for all vehicles (in the data set), while also calculating the percentiles of a given foreign car among only other foreign cars and of a given domestic car among only other domestic cars. I have tried using the by var: and if foreign == 0, but I am not getting what I need. Ideally, for the foreign only percentile, I would want an observation to show "25" in the cell if it falls into the 25th percentile for foreign only cars. I am aware (and somewhat expect) that the same observation might fall into multiple quantiles depending on the sample being measured (overall, foreign, or domestic).
Any help would be greatly appreciated!
Goal: To produce a variable that will either (a) give the exact percentile of an observation within a data set or (b) (better yet) report the quantile an observation falls into.
Additional Caveats: 1) I'd like to be able to restrict the percentile calculation only to a certain subset/category of observations in a data set, without dropping observations. 2) If I'm able to create a variable that identifies the various quantiles in which observations fall into, I'd ideally like to do this for a predetermined set of particular quantiles I'm interested in.
Example code:
sysuse auto
//(1978 automobile data); there are 74 observations
pctile percentile = mpg, nq(4)
//Variable "percentile" produced with label "percentiles of mpg"
//Note: However, when I summarize the percentile variable, all I get is this:
summarize percentile
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
percentile | 3 21 3.605551 18 25
tab mpg percentile if percentile > 0
Mileage | percentiles of mpg
(mpg) | 18 20 25 | Total
-----------+---------------------------------+----------
17 | 0 1 0 | 1
22 | 1 0 1 | 2
-----------+---------------------------------+----------
Total | 1 1 1 | 3
tab percentile
percentiles |
of mpg | Freq. Percent Cum.
------------+-----------------------------------
18 | 1 33.33 33.33
20 | 1 33.33 66.67
25 | 1 33.33 100.00
------------+-----------------------------------
Total | 3 100.00
//First off, it only identifies three quantiles - 18, 20, and 25 - and only places the first three observations (one of which has a mileage of 17 mpg and two of which have a mileage of 22 mph) as falling into these three quantiles. I try the option genp:
drop percentile
pctile percentile = mpg, nq(4) genp(percent)
//Still this only creates percentages for the same three variables that were categorized into quantiles with the commands above. The results are below:
tab mpg percent if percent > 0
Mileage | percentages for __000002
(mpg) | 25 50 75 | Total
-----------+---------------------------------+----------
17 | 0 1 0 | 1
22 | 1 0 1 | 2
-----------+---------------------------------+----------
Total | 1 1 1 | 3
What I cannot figure out how to do is create a command that will place all 74 observations into 9 distinct quantiles - 1st percentile, 5th percentile, 10th percentile, 25th percentile, 50th percentile, 75th percentile, 95th percentile, and 99th percentile. I would like for this variable to populate with numeric data. For example, an observation that falls into the 25th percentile would have a value of "25" in the column corresponding to this variable. Even if I can get code that simply gives the percentile for an observation, I can easily write code that groups the percentiles into quantiles. I've tried altering the nq option, but it appears that this option only limits the number of observations placed into a percentile to "q-1" and that nq does not represent the number of quintiles you want to have.
To complicate things, I need to calculate percentiles for mpg within categories or subsets of the data, in addition to percentiles for each observation overall. The example I have (but have been unsuccessful in trying to execute) is the foreign variable in this data set. I would like to calculate an overall percentile for mpg for all vehicles (in the data set), while also calculating the percentiles of a given foreign car among only other foreign cars and of a given domestic car among only other domestic cars. I have tried using the by var: and if foreign == 0, but I am not getting what I need. Ideally, for the foreign only percentile, I would want an observation to show "25" in the cell if it falls into the 25th percentile for foreign only cars. I am aware (and somewhat expect) that the same observation might fall into multiple quantiles depending on the sample being measured (overall, foreign, or domestic).
Any help would be greatly appreciated!
Comment