Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -pctilesets- downloadable from SSC: percentile sets

    As promised in https://www.statalist.org/forums/for...-interval-sets a new command pctilesets is now downloadable from SSC.

    Stata 8.1 is required.

    Here is the executive summary (edited very slightly) from

    Code:
    ssc desc pctilesets
    to enable you to decide quickly whether you should care.

    pctilesets computes percentile or quantile sets. It is a
    wrapper for summarize. Typically the user will specify one or
    more of the options minimum, maximum or pctile() to select
    extremes and/or any or all of the percentiles for 1, 5, 10, 25
    (lower quartile), 50 (median), 75 (upper quartile), 90, 95 and
    99% cumulative probability. There are two syntaxes, for
    variables and for groups of observations. pctilesets by default
    lists its results. Although saving to a permanent dataset is
    optional, that is the intended key to many useful applications,
    graphical and otherwise.


    A percentile set consists of a temporary dataset consisting of some or occasionally all of the following variables.

    * varname is a string variable holding the name or names of the variable(s) being summarized.

    * varlabel is a string variable holding the variable label of each variable being summarized. If no variable label has been defined, the value is instead the
    variable name.

    * (Groups syntax only) origgvar is a numeric or string variable as specified in the over() option.

    * (Groups syntax only) groupvar is a string variable holding the name of the group variable specified in the over() option.

    * (Groups syntax only) gvarlabel is a string variable holding the variable label of the group variable groupvar specified in the over() option. If no variable label
    has been defined, the value is instead the variable name.

    * (Groups syntax only) group is a numeric variable with value labels describing each distinct value of groupvar. Each such variable has integer values 1 up and
    value labels derived from the variable specified.

    * n is a numeric variable holding the number of observations used in the estimate.

    * Any or all of min, p1, p5, p10, p25, p50, p75, p90, p95, p99 or max holding results for the measure concerned.

    (Detail: As first posted, the help file contains a typo in the explanation above that will be corrected in due course.)

    * weights is a string variable appearing if (and only if) weights were specified as a record of such use.


    Examples may well serve better than any amount of exegesis. Those here stress graphical applications, which I have most in mind for my own use, but other uses may well occur to you.

    In this first example, I use qplot from the Stata Journal for quantile plots and in the spirit, if not quite the letter, of Emanuel Parzen https://www.jstor.org/stable/2286734 present hybrid quantile-box plots that aim for a combination of the informative detail of quantile plots with the skeletal summary of box plots. The detail exploited is that each cumulative probability scale on the x axis is known to stretch from 0 to 1, so 1.1 is a safe place to put the box plots. Naturally there are other ways to do that: you could call up egen repeatedly to put the five-number summaries (extremes, quartiles, median) in new variables. With this design there is no need for arbitrary criteria on what to show beyond the quartiles. (Tukey's rule of thumb to plot points individually if and only if they lie at least 1.5 IQR from the nearer quartile seems to remain the most common convention, but there is all too much evidence that it is, variously, not explained at all by authors who use it, or poorly explained, or even incorrectly explained. It may safely be guessed that readers are often not well placed to know better.


    Code:
        . sysuse auto, clear
    
        . pctilesets mpg, p(25 50 75) min max over(foreign) saving(foo, replace)
    
        . clonevar origgvar=foreign
    
        . merge m:1 origgvar using foo
    
        . gen where = 1.1
    
        . #delimit ;
        . qplot mpg, ms(O) by(foreign, note("") legend(off))
        addplot(scatter p50 where, ms(Dh) msize(medlarge) pstyle(p2)
        xla(0 1 0.25 "0.25" 0.5 "0.5" 0.75 "0.75") xtitle(Fraction of data)
        || rbar p75 p25 where, fcolor(none) barw(0.12) pstyle(p2)
        || rspike p25 min where, pstyle(p2)
        || rspike p75 max where, pstyle(p2)) name(QB1, replace);
        . #delimit cr
    Click image for larger version

Name:	QB1.png
Views:	1
Size:	54.9 KB
ID:	1767396




    As a twist on that, let's use the normal distribution as reference. That doesn't imply an attitude that the data are, or should be, normally distributed, any more than the default quantile plot implies that the data are, or should be, uniformly distributed.

    In practice you may need two passes, or a small calculation on the side, to work out a good location for the box plots.

    Code:
     . replace where = 2.7
    
        . #delimit ;
        . qplot mpg, ms(O) by(foreign, note("") legend(off))
        xla(-2/2) xtitle(Standard normal deviate)
        addplot(scatter p50 where, ms(Dh) msize(medlarge) pstyle(p2)
        || rbar p75 p25 where, fcolor(none) barw(0.44) pstyle(p2)
        || rspike p25 min where, pstyle(p2)
        || rspike p75 max where, pstyle(p2)) trscale(invnormal(@))
        name(QB2, replace);
        . #delimit cr
    Click image for larger version

Name:	QB2.png
Views:	1
Size:	49.5 KB
ID:	1767397




    As yet another twist, we revert to Parzen's original idea that the quantile plot and the box plot can share exactly the same space: not only are median and quartiles levels that can be shown on the quantile axis, so also 0.25 0.5 0.75 are levels that can be shown on the probability axis.

    I like the mantra for explaining the idea that

    if half the data points -- those between the quartiles -- lie inside each box, then as part of the same idea half the data points must lie outside each box -- and often they are the more interesting or important half!

    However, once that point is seen, I tend to move back to showing quantile and box plots side by side.

    .
    Click image for larger version

Name:	qb3.png
Views:	1
Size:	49.6 KB
ID:	1767398



    This last example uses a different dataset, from Mardia et al. (1979, 2024) The appearance of a second edition 45 years after the first should inspire technical authors everywhere.

    Here, as often, we do need a different layout for best presentation, and we should want to avoid the dopeyness of alphabetical order (algebra first!). myaxis from the Stata Journal helps there.


    Code:
        . use https://www.stata-journal.com/software/sj20-2/pr0046_1/mathsmarks, clear
        . rename * (marks*)
        . gen id = _n
        . reshape long marks, i(id) j(subject) string
        . myaxis subject2=subject, sort(median marks)
    
        . qplot marks, by(subject2, row(1) compact) ytitle(Mathematics marks)
    
        . pctilesets marks, over(subject2) min max p(25 50 75) saving(foo, replace)
        . gen where = 1.1
        . clonevar origgvar=subject2
        . merge m:1 origgvar using foo
    
        . #delimit ;
        . qplot marks, by(subject2, row(1) compact legend(off) note(""))
        addplot(rspike p25 min where, pstyle(p2)
        || rspike p75 max where, pstyle(p2)
        || rbar p25 p75 where, barw(0.16) fcolor(none) pstyle(p2)
        || scatter p50 where, ms(Dh) msize(medlarge) pstyle(p2))
        ytitle(Mathematics marks)
        xla(0 0.25 "0.25" 0.5 "0.5" 0.75 "0.75" 1) xtitle(Fraction of data) name(QB4, replace);
        . #delimit cr
    Click image for larger version

Name:	QB4.png
Views:	1
Size:	91.8 KB
ID:	1767399




    Mardia, K. V., J. T. Kent and J. M. Bibby. 1979. Multivariate Analysis. London: Academic Press.

    Mardia, K. V., J. T. Kent and C. C. Taylor. 2024. Multivariate Analysis. Hoboken, NJ: John Wiley.

    Note that pctilesets allows many other choices. For example, plotting whiskers stretching to say 5% and 95% points, or 10% and 90% points, is to my mind a simpler alternative to the Tukey convention (and an older idea, stretching back to Bowley (and Galton before him)).

    Anyone interested -- except those behind fearsome firewalls -- can naturally download the code and look at the help file, noting that an ancillary file contains all the code for the examples.

    A sequel command for sets of moment-based measures is already done and will be posted shortly. .

    The small theme of exploiting addplot() options was also flagged recently in https://www.statalist.org/forums/for...addplot-option

  • #2
    momentsets is now available from SSC as promised in #1, thanks as ever to Kit Baum. Stata 8.1 is needed. (Here and above, the sense is that I am not aware that either command needs a later version, but I haven't tested the code on Stata 8.1)

    momentsets computes moment-based measures and collects them into
    datasets. It is a wrapper for summarize. Typically the user
    will specify one or more of the options mean, sd, var, skewness
    or kurtosis. There are two syntaxes, for variables and for
    groups of observations. momentsets by default lists its results.
    Although saving to a permanent dataset is optional, that is the
    intended key to many useful applications, graphical and
    otherwise.


    I will not announce it separately. The point is simple and should make sense if you have studied #1. This little project is not complete yet, but the mystery D'Artagnan to join the Three Musketeers cisets, pctilesets and momentsets is firmly in mind, but not yet in sight.

    Backing up, I strongly wanted what became cisets; then it became clear that a parallel command for selected quantiles would be something I would use much; finally a similar command for moment-based measures seemed like a complement.

    Uses might be preparing the ground for a plot of SD or variance against mean, or of kurtosis against skewness.

    Comment


    • #3
      Often confused about percentile, quantile, quartile and quintile for a non-native english speaker, and confused about Stata command such as pctile, xtile, centile and use-written commands such as quantitles, sumdist. I will come back to Nick's post here when I encounter questions in this regard.
      https://www.statalist.org/forums/for...nd-percentiles
      https://stats.stackexchange.com/questions/156778/percentile-vs-quantile-vs-quartile
      https://statisticalpoint.com/percent...e-vs-quantile/

      Comment


      • #4
        Yet another reference https://stats.stackexchange.com/ques...half-a-percent

        The thread title isn't promising and my answer starts by trying to respect the question, but later edits bring it closer to #3.

        Quartiles. There are 3 of them and they divide the data into 4 groups, ideally of equal frequency.

        Quintiles. There are 4 of them and they divide the data into 5 groups, ideally of equal frequency.

        Percentiles. There are 99 of them and they divide the data into 100 groups, ideally of equal frequency. However, I doubt many people would object to e.g. the 97.5 % point as also being a percentile. This definition doesn't commit you to calculating or using them all.

        Quantile. The general term for any measure defined by some fraction being smaller and the complementary fraction being larger.

        These are just the words. The recipes for calculation vary, depending on

        1. sample size being a multiple of the number of quantiles, ot not

        2. what to do about ties and what to do about less than/equal to/more than

        3. what to do about weights

        and yet other details.

        Comment

        Working...
        X