Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I compute the percentile corresponding to a given value?

    Dear all,

    I want to find the percentile of BMI that corresponds to a BMI of 25 kg/m2, so I can use this percentile as cut-off for the BMI z-score. I can do it “manually” using the centile command several times by ‘guessing’ until I type the centile that corresponds to exactly 25 kg/m2. My question is, can this be done in STATA in an automated way? So I don’t have to redo it after for instance excluding some individuals.

    Many thanks in advance

  • #2
    Have a look at help cumul. With cumul you can calculate the cumulative distribution for BMI, and then look at the value of this for BMI = 25. Given the estimate of F(x) = p for p between 0 and 1, you can find p_bar(x) = F^-1(25)

    Code something like the following

    Code:
    cumul BMI, gen(cumBMI) equal  // use -equal- option if appropriate in your case
    
    list cumBMI if BMI = 25
    
    summarize cumBMI if BMI = 25
    
    ge lowBMI = BMI < r(mean)   // plus code to deal with missing values if relevant

    Comment


    • #3
      Something like

      Code:
      egen pctile = mean((bmi < 25) / (bmi < .))
      should get you started.

      You don't give example data, so let's jump to an analogue. Pretend we want to know the percentile of 25 mpg in the auto dataset.

      Code:
      . sysuse auto, clear
      (1978 Automobile Data)
      
      . egen pctile = mean((mpg < 25) / (mpg < .))
      
      . su pctile
      
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
            pctile |         74    .7432432           0   .7432432   .7432432
      Creating a variable to hold a constant is usually overkill, but there is method to the madness.

      But first, let's explain. (mpg < 25) is evaluated as true (1) or false (0), so its mean is the proportion less than 25. More on that at http://www.stata.com/support/faqs/da...rue-and-false/

      So why are we dividing by (mpg < .)? That is true (1 again) if mpg is not missing and false (0 again) when it is missing.

      Whenever we divide by 0 we will get missings, but that's good when it happens: the mean() function just ignores the missings, which is what we want. More on that at http://www.stata-journal.com/sjpdf.h...iclenum=dm0055 Section 10.

      (If we knew that all the missings were missing because the people were too heavy to measure, that would be the wrong thing to do....)

      What's nice about this method is that it is easily extended.

      OK neat, but I really want this done by groups.

      Code:
      . egen pctile_2 = mean((mpg < 25) / (mpg < .)), by(foreign)
      
      . tabdisp foreign, c(pctile_2)
      
      ----------------------
       Car type |   pctile_2
      ----------+-----------
       Domestic |   .8461539
        Foreign |         .5
      ----------------------
      
      . tabdisp foreign, c(pctile_2) format(%4.3f)
      
      ----------------------
       Car type |   pctile_2
      ----------+-----------
       Domestic |      0.846
        Foreign |      0.500
      ----------------------
      OK neat, but I really want this as a percent

      Code:
      . egen pctile_3 = mean(100 * (mpg < 25) / (mpg < .)), by(foreign)
      
      . tabdisp foreign, c(pctile_3) format(%3.1f)
      
      ----------------------
      Car type | pctile_3
      ----------+-----------
      Domestic | 84.6
      Foreign | 50.0
      ----------------------
      OK neat, but I really want a smarter definition to cope with ties. I will split values equal to 25 and count just half of them, Tukey-style (and Galton-style).

      Code:
      . egen pctile_4 = mean(100 * ((mpg < 25) + 0.5 * (mpg == 25)) / (mpg < .)), by(foreign)
      
      . tabdisp foreign, c(pctile_4) format(%3.1f)
      
      ----------------------
       Car type |   pctile_4
      ----------+-----------
       Domestic |       85.6
        Foreign |       59.1
      ----------------------

      OK neat, but I really want...

      ....

      By the way, Frank Harrell has a sharp comment somewhere on it being immensely more sensible clinically and epidemiology to use BMI as a predictor, not its percentile.

      For more on percentiles, see e.g. http://www.stata.com/support/faqs/st...ing-positions/
      Last edited by Nick Cox; 01 Dec 2016, 02:33.

      Comment


      • #4
        Stephen's suggestion contains minor typos.

        Code:
        if BMI = 25
        should be

        Code:
        if BMI == 25

        Comment


        • #5
          Woops. After soooo many years using Stata, my error is really embarrassing!

          Comment


          • #6
            Stephen: Cheer up, you're still young. Someone found a bug in one of my programs only yesterday.

            Comment


            • #7
              Thanks for the fast responces. Very valuable!
              The line:
              ge lowBMI = BMI < r(mean) does not work for me. I get a new variabel = 0 for all individuals. What was the purpose of this line?
              Can I use the r(mean) in combination with the centile command?, for instance like this:

              "centile zc_bmi, centile (r(mean))"

              It does not work. It says it's an invalid numlist.
              Thanks again.

              Comment


              • #8
                Lisa in #6 wrote
                ge lowBMI = BMI < r(mean) does not work for me. I get a new variabel = 0 for all individuals. What was the purpose of this line?
                Purpose of line: to generate a new binary indicator equal to 1 if observation has a BMI below your critical percentile cut-off; 0 otherwise (with missing BMI values treated as zeros), applying the logic regarding inverting the empirical CDF set out above. The summarize line is intended to derive the value of the percentile at BMI = 25, noting that the value is saved in r(mean) ... which you can then refer to.

                Why do you get zero values for lowBMI for all cases? My best guess is that you have no cases with BMI equal to 25 exactly. (In which case the logic above based on summarize doesn't work.)

                This is all guess work from our side because you have not shown us exactly what you sent Stata and exactly what you got back (all of the relevant lines) and reported it to us using CODE delimiters. (See the Forum FAQ about this.) The list code line would have shown if you had observations with BMI = 25 exactly

                Here's a worked example with the auto data set to illustrate the issues

                Code:
                . sysuse auto, clear
                (1978 Automobile Data)
                
                . cumul price, gen(cumPrice) equal
                
                .
                . list price cumPrice if price == 5000
                
                . list price cumPrice if abs(price - 5000) < 50
                
                . list price cumPrice if abs(price - 5000) < 100
                
                     +------------------+
                     | price   cumPrice |
                     |------------------|
                 48. | 4,934         .5 |
                 58. | 5,079   .5135135 |
                     +------------------+
                
                .
                . summ price cumPrice
                
                    Variable |        Obs        Mean    Std. Dev.       Min        Max
                -------------+---------------------------------------------------------
                       price |         74    6165.257    2949.496       3291      15906
                    cumPrice |         74    .5067568    .2906191   .0135135          1
                
                . summ cumPrice if price == 5000
                
                    Variable |        Obs        Mean    Std. Dev.       Min        Max
                -------------+---------------------------------------------------------
                    cumPrice |          0
                
                . return list
                
                scalars:
                                  r(N) =  0
                              r(sum_w) =  0
                                r(sum) =  0
                
                .
                . summ cumPrice if abs(price - 5000) < 100
                
                    Variable |        Obs        Mean    Std. Dev.       Min        Max
                -------------+---------------------------------------------------------
                    cumPrice |          2    .5067568    .0095555         .5   .5135135
                
                . return list
                
                scalars:
                                  r(N) =  2
                              r(sum_w) =  2
                               r(mean) =  .5067567527294159
                                r(Var) =  .0000913074148929
                                 r(sd) =  .0095554913475414
                                r(min) =  .5
                                r(max) =  .5135135054588318
                                r(sum) =  1.013513505458832
                
                .
                . ge lowPrice = cumPrice < r(mean)
                
                . ta lowPrice
                
                   lowPrice |      Freq.     Percent        Cum.
                ------------+-----------------------------------
                          0 |         37       50.00       50.00
                          1 |         37       50.00      100.00
                ------------+-----------------------------------
                      Total |         74      100.00

                Comment


                • #9
                  Thanks again, also for the advice on how to post in here. I'm still an amateur.

                  Now I see why "ge lowBMI = BMI < r(mean) " did not work. It should be "ge lowBMI = cumBMI < r(mean)" (no one has a BMI <0.9).
                  Then I just need to find out how to make combine this with the centile command, so I get the corresponding centile for the BMI z-score.

                  Comment


                  • #10
                    There is still advice for the reading in #3.

                    Comment

                    Working...
                    X