-pctilesets- downloadable from SSC: percentile sets

Nick Cox

Join Date: Mar 2014

Posts: 35694
#1

-pctilesets- downloadable from SSC: percentile sets

11 Nov 2024, 06:01

As promised in https://www.statalist.org/forums/for...-interval-sets a new command pctilesets is now downloadable from SSC.

Stata 8.1 is required.

Here is the executive summary (edited very slightly) from

Code:

ssc desc pctilesets

to enable you to decide quickly whether you should care.

pctilesets computes percentile or quantile sets. It is a
wrapper for summarize. Typically the user will specify one or
more of the options minimum, maximum or pctile() to select
extremes and/or any or all of the percentiles for 1, 5, 10, 25
(lower quartile), 50 (median), 75 (upper quartile), 90, 95 and
99% cumulative probability. There are two syntaxes, for
variables and for groups of observations. pctilesets by default
lists its results. Although saving to a permanent dataset is
optional, that is the intended key to many useful applications,
graphical and otherwise.

A percentile set consists of a temporary dataset consisting of some or occasionally all of the following variables.

* varname is a string variable holding the name or names of the variable(s) being summarized.

* varlabel is a string variable holding the variable label of each variable being summarized. If no variable label has been defined, the value is instead the
variable name.

* (Groups syntax only) origgvar is a numeric or string variable as specified in the over() option.

* (Groups syntax only) groupvar is a string variable holding the name of the group variable specified in the over() option.

* (Groups syntax only) gvarlabel is a string variable holding the variable label of the group variable groupvar specified in the over() option. If no variable label
has been defined, the value is instead the variable name.

* (Groups syntax only) group is a numeric variable with value labels describing each distinct value of groupvar. Each such variable has integer values 1 up and
value labels derived from the variable specified.

* n is a numeric variable holding the number of observations used in the estimate.

* Any or all of min, p1, p5, p10, p25, p50, p75, p90, p95, p99 or max holding results for the measure concerned.

(Detail: As first posted, the help file contains a typo in the explanation above that will be corrected in due course.)

* weights is a string variable appearing if (and only if) weights were specified as a record of such use.

Examples may well serve better than any amount of exegesis. Those here stress graphical applications, which I have most in mind for my own use, but other uses may well occur to you.

In this first example, I use qplot from the Stata Journal for quantile plots and in the spirit, if not quite the letter, of Emanuel Parzen https://www.jstor.org/stable/2286734 present hybrid quantile-box plots that aim for a combination of the informative detail of quantile plots with the skeletal summary of box plots. The detail exploited is that each cumulative probability scale on the x axis is known to stretch from 0 to 1, so 1.1 is a safe place to put the box plots. Naturally there are other ways to do that: you could call up egen repeatedly to put the five-number summaries (extremes, quartiles, median) in new variables. With this design there is no need for arbitrary criteria on what to show beyond the quartiles. (Tukey's rule of thumb to plot points individually if and only if they lie at least 1.5 IQR from the nearer quartile seems to remain the most common convention, but there is all too much evidence that it is, variously, not explained at all by authors who use it, or poorly explained, or even incorrectly explained. It may safely be guessed that readers are often not well placed to know better.

Code:

. sysuse auto, clear . pctilesets mpg, p(25 50 75) min max over(foreign) saving(foo, replace) . clonevar origgvar=foreign . merge m:1 origgvar using foo . gen where = 1.1 . #delimit ; . qplot mpg, ms(O) by(foreign, note("") legend(off)) addplot(scatter p50 where, ms(Dh) msize(medlarge) pstyle(p2) xla(0 1 0.25 "0.25" 0.5 "0.5" 0.75 "0.75") xtitle(Fraction of data) || rbar p75 p25 where, fcolor(none) barw(0.12) pstyle(p2) || rspike p25 min where, pstyle(p2) || rspike p75 max where, pstyle(p2)) name(QB1, replace); . #delimit cr

As a twist on that, let's use the normal distribution as reference. That doesn't imply an attitude that the data are, or should be, normally distributed, any more than the default quantile plot implies that the data are, or should be, uniformly distributed.

In practice you may need two passes, or a small calculation on the side, to work out a good location for the box plots.

Code:

. replace where = 2.7 . #delimit ; . qplot mpg, ms(O) by(foreign, note("") legend(off)) xla(-2/2) xtitle(Standard normal deviate) addplot(scatter p50 where, ms(Dh) msize(medlarge) pstyle(p2) || rbar p75 p25 where, fcolor(none) barw(0.44) pstyle(p2) || rspike p25 min where, pstyle(p2) || rspike p75 max where, pstyle(p2)) trscale(invnormal(@)) name(QB2, replace); . #delimit cr

As yet another twist, we revert to Parzen's original idea that the quantile plot and the box plot can share exactly the same space: not only are median and quartiles levels that can be shown on the quantile axis, so also 0.25 0.5 0.75 are levels that can be shown on the probability axis.

I like the mantra for explaining the idea that

if half the data points -- those between the quartiles -- lie inside each box, then as part of the same idea half the data points must lie outside each box -- and often they are the more interesting or important half!

However, once that point is seen, I tend to move back to showing quantile and box plots side by side.

.

This last example uses a different dataset, from Mardia et al. (1979, 2024) The appearance of a second edition 45 years after the first should inspire technical authors everywhere.

Here, as often, we do need a different layout for best presentation, and we should want to avoid the dopeyness of alphabetical order (algebra first!). myaxis from the Stata Journal helps there.

Code:

. use https://www.stata-journal.com/software/sj20-2/pr0046_1/mathsmarks, clear . rename * (marks*) . gen id = _n . reshape long marks, i(id) j(subject) string . myaxis subject2=subject, sort(median marks) . qplot marks, by(subject2, row(1) compact) ytitle(Mathematics marks) . pctilesets marks, over(subject2) min max p(25 50 75) saving(foo, replace) . gen where = 1.1 . clonevar origgvar=subject2 . merge m:1 origgvar using foo . #delimit ; . qplot marks, by(subject2, row(1) compact legend(off) note("")) addplot(rspike p25 min where, pstyle(p2) || rspike p75 max where, pstyle(p2) || rbar p25 p75 where, barw(0.16) fcolor(none) pstyle(p2) || scatter p50 where, ms(Dh) msize(medlarge) pstyle(p2)) ytitle(Mathematics marks) xla(0 0.25 "0.25" 0.5 "0.5" 0.75 "0.75" 1) xtitle(Fraction of data) name(QB4, replace); . #delimit cr

Mardia, K. V., J. T. Kent and J. M. Bibby. 1979. Multivariate Analysis. London: Academic Press.

Mardia, K. V., J. T. Kent and C. C. Taylor. 2024. Multivariate Analysis. Hoboken, NJ: John Wiley.

Note that pctilesets allows many other choices. For example, plotting whiskers stretching to say 5% and 95% points, or 10% and 90% points, is to my mind a simpler alternative to the Tukey convention (and an older idea, stretching back to Bowley (and Galton before him)).

Anyone interested -- except those behind fearsome firewalls -- can naturally download the code and look at the help file, noting that an ancillary file contains all the code for the examples.

A sequel command for sets of moment-based measures is already done and will be posted shortly. .

The small theme of exploiting addplot() options was also flagged recently in https://www.statalist.org/forums/for...addplot-option
Tags: None

2 likes
Nick Cox

Join Date: Mar 2014

Posts: 35694
#2

11 Nov 2024, 10:26

momentsets is now available from SSC as promised in #1, thanks as ever to Kit Baum. Stata 8.1 is needed. (Here and above, the sense is that I am not aware that either command needs a later version, but I haven't tested the code on Stata 8.1)

momentsets computes moment-based measures and collects them into
datasets. It is a wrapper for summarize. Typically the user
will specify one or more of the options mean, sd, var, skewness
or kurtosis. There are two syntaxes, for variables and for
groups of observations. momentsets by default lists its results.
Although saving to a permanent dataset is optional, that is the
intended key to many useful applications, graphical and
otherwise.

I will not announce it separately. The point is simple and should make sense if you have studied #1. This little project is not complete yet, but the mystery D'Artagnan to join the Three Musketeers cisets, pctilesets and momentsets is firmly in mind, but not yet in sight.

Backing up, I strongly wanted what became cisets; then it became clear that a parallel command for selected quantiles would be something I would use much; finally a similar command for moment-based measures seemed like a complement.

Uses might be preparing the ground for a plot of SD or variance against mean, or of kurtosis against skewness.
1 like
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 914
#3

11 Nov 2024, 16:45

Often confused about percentile, quantile, quartile and quintile for a non-native english speaker, and confused about Stata command such as pctile, xtile, centile and use-written commands such as quantitles, sumdist. I will come back to Nick's post here when I encounter questions in this regard.
https://www.statalist.org/forums/for...nd-percentiles
https://stats.stackexchange.com/questions/156778/percentile-vs-quantile-vs-quartile
https://statisticalpoint.com/percent...e-vs-quantile/
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#4

11 Nov 2024, 17:00

Yet another reference https://stats.stackexchange.com/ques...half-a-percent

The thread title isn't promising and my answer starts by trying to respect the question, but later edits bring it closer to #3.

Quartiles. There are 3 of them and they divide the data into 4 groups, ideally of equal frequency.

Quintiles. There are 4 of them and they divide the data into 5 groups, ideally of equal frequency.

Percentiles. There are 99 of them and they divide the data into 100 groups, ideally of equal frequency. However, I doubt many people would object to e.g. the 97.5 % point as also being a percentile. This definition doesn't commit you to calculating or using them all.

Quantile. The general term for any measure defined by some fraction being smaller and the complementary fraction being larger.

These are just the words. The recipes for calculation vary, depending on

1. sample size being a multiple of the number of quantiles, ot not

2. what to do about ties and what to do about less than/equal to/more than

3. what to do about weights

and yet other details.
2 likes
Comment

Announcement

-pctilesets- downloadable from SSC: percentile sets

Comment

Comment

Comment