Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using pctile (or xtile) to create a variable that reports which percentile a given observation falls into

    I am trying to create a variable that reports the percentile that a given observation falls into within a data set. I have been trying to use the pctile and xtile codes, but cannot produce exactly what I am looking for because I am relatively new to this.

    Goal: To produce a variable that will either (a) give the exact percentile of an observation within a data set or (b) (better yet) report the quantile an observation falls into.
    Additional Caveats: 1) I'd like to be able to restrict the percentile calculation only to a certain subset/category of observations in a data set, without dropping observations. 2) If I'm able to create a variable that identifies the various quantiles in which observations fall into, I'd ideally like to do this for a predetermined set of particular quantiles I'm interested in.

    Example code:

    sysuse auto
    //(1978 automobile data); there are 74 observations

    pctile percentile = mpg, nq(4)
    //Variable "percentile" produced with label "percentiles of mpg"

    //Note: However, when I summarize the percentile variable, all I get is this:

    summarize percentile

    Variable | Obs Mean Std. dev. Min Max
    -------------+---------------------------------------------------------
    percentile | 3 21 3.605551 18 25


    tab mpg percentile if percentile > 0

    Mileage | percentiles of mpg
    (mpg) | 18 20 25 | Total
    -----------+---------------------------------+----------
    17 | 0 1 0 | 1
    22 | 1 0 1 | 2
    -----------+---------------------------------+----------
    Total | 1 1 1 | 3


    tab percentile

    percentiles |
    of mpg | Freq. Percent Cum.
    ------------+-----------------------------------
    18 | 1 33.33 33.33
    20 | 1 33.33 66.67
    25 | 1 33.33 100.00
    ------------+-----------------------------------
    Total | 3 100.00


    //First off, it only identifies three quantiles - 18, 20, and 25 - and only places the first three observations (one of which has a mileage of 17 mpg and two of which have a mileage of 22 mph) as falling into these three quantiles. I try the option genp:

    drop percentile
    pctile percentile = mpg, nq(4) genp(percent)


    //Still this only creates percentages for the same three variables that were categorized into quantiles with the commands above. The results are below:

    tab mpg percent if percent > 0

    Mileage | percentages for __000002
    (mpg) | 25 50 75 | Total
    -----------+---------------------------------+----------
    17 | 0 1 0 | 1
    22 | 1 0 1 | 2
    -----------+---------------------------------+----------
    Total | 1 1 1 | 3


    What I cannot figure out how to do is create a command that will place all 74 observations into 9 distinct quantiles - 1st percentile, 5th percentile, 10th percentile, 25th percentile, 50th percentile, 75th percentile, 95th percentile, and 99th percentile. I would like for this variable to populate with numeric data. For example, an observation that falls into the 25th percentile would have a value of "25" in the column corresponding to this variable. Even if I can get code that simply gives the percentile for an observation, I can easily write code that groups the percentiles into quantiles. I've tried altering the nq option, but it appears that this option only limits the number of observations placed into a percentile to "q-1" and that nq does not represent the number of quintiles you want to have.

    To complicate things, I need to calculate percentiles for mpg within categories or subsets of the data, in addition to percentiles for each observation overall. The example I have (but have been unsuccessful in trying to execute) is the foreign variable in this data set. I would like to calculate an overall percentile for mpg for all vehicles (in the data set), while also calculating the percentiles of a given foreign car among only other foreign cars and of a given domestic car among only other domestic cars. I have tried using the by var: and if foreign == 0, but I am not getting what I need. Ideally, for the foreign only percentile, I would want an observation to show "25" in the cell if it falls into the 25th percentile for foreign only cars. I am aware (and somewhat expect) that the same observation might fall into multiple quantiles depending on the sample being measured (overall, foreign, or domestic).

    Any help would be greatly appreciated!









  • #2
    Code:
    sysuse auto, clear
    
    local dq 1 5 10 25 50 75 95 99
    
    levelsof rep78, local(groups)
    
    gen wanted = .
    foreach g of local groups {
        summ mpg if rep78 == `g', detail
        return list
        foreach d of local dq {
            replace wanted = `d' if rep78 == `g' & mpg <= `r(p`d')' & missing(wanted)
        }
    }
    I did the grouping by rep78 instead of by foreign.

    Note: This code exploits the fact that the percentiles in which O.P. expressed interest are precisely the ones returned by -summ, detail-, thus avoiding more complicated code working with the results of -pctile-.

    Comment


    • #3
      Thank you! This is incredibly helpful.

      I should have put my next-level question in as well: If I wanted to do the grouping by a subset created using two variables instead of one, would this code still work? For example if I wanted to do a percentile calculation for every possible combination of foreign and rep78? Alternatively, how could it work if I wanted to do it only for those observations where the car was foreign and rep78 == 3 (just for the sake of example)?

      Comment


      • #4
        For the combination of foreign and rep78:

        Code:
        sysuse auto, clear
        
        local dq 1 5 10 25 50 75 95 99
        
        levelsof rep78, local(groups)
        
        gen wanted = .
        foreach g of local groups {
            levelsof foreign if rep78 == `g', local(groups2)
            foreach g2 of local groups2 {
                summ mpg if rep78 == `g' & foreign == `g2', detail
                return list
                foreach d of local dq {
                    replace wanted = `d' if rep78 == `g' & foreign == `g2' ///
                        & mpg <= `r(p`d')' & missing(wanted)
                }
            }
        }
        This is as complicated as this because there are combinations of values of foreign and rep78 that never co-occur, so the code has to deal with that.

        If you just want to do it for rep78 == 3, then it's much simpler code:
        Code:
        sysuse auto, clear
        
        local dq 1 5 10 25 50 75 95 99
        
        
        gen wanted = .
        summ mpg if rep78 == 3, detail
        return list
        foreach d of local dq {
            replace wanted = `d' if rep78 == 3 & mpg <= `r(p`d')' & missing(wanted)
        }

        Comment


        • #5
          Thank you again for the second response! Super helpful.

          Comment

          Working...
          X