As promised in https://www.statalist.org/forums/for...-interval-sets a new command pctilesets is now downloadable from SSC.
Stata 8.1 is required.
Here is the executive summary (edited very slightly) from
to enable you to decide quickly whether you should care.
pctilesets computes percentile or quantile sets. It is a
wrapper for summarize. Typically the user will specify one or
more of the options minimum, maximum or pctile() to select
extremes and/or any or all of the percentiles for 1, 5, 10, 25
(lower quartile), 50 (median), 75 (upper quartile), 90, 95 and
99% cumulative probability. There are two syntaxes, for
variables and for groups of observations. pctilesets by default
lists its results. Although saving to a permanent dataset is
optional, that is the intended key to many useful applications,
graphical and otherwise.
A percentile set consists of a temporary dataset consisting of some or occasionally all of the following variables.
* varname is a string variable holding the name or names of the variable(s) being summarized.
* varlabel is a string variable holding the variable label of each variable being summarized. If no variable label has been defined, the value is instead the
variable name.
* (Groups syntax only) origgvar is a numeric or string variable as specified in the over() option.
* (Groups syntax only) groupvar is a string variable holding the name of the group variable specified in the over() option.
* (Groups syntax only) gvarlabel is a string variable holding the variable label of the group variable groupvar specified in the over() option. If no variable label
has been defined, the value is instead the variable name.
* (Groups syntax only) group is a numeric variable with value labels describing each distinct value of groupvar. Each such variable has integer values 1 up and
value labels derived from the variable specified.
* n is a numeric variable holding the number of observations used in the estimate.
* Any or all of min, p1, p5, p10, p25, p50, p75, p90, p95, p99 or max holding results for the measure concerned.
(Detail: As first posted, the help file contains a typo in the explanation above that will be corrected in due course.)
* weights is a string variable appearing if (and only if) weights were specified as a record of such use.
Examples may well serve better than any amount of exegesis. Those here stress graphical applications, which I have most in mind for my own use, but other uses may well occur to you.
In this first example, I use qplot from the Stata Journal for quantile plots and in the spirit, if not quite the letter, of Emanuel Parzen https://www.jstor.org/stable/2286734 present hybrid quantile-box plots that aim for a combination of the informative detail of quantile plots with the skeletal summary of box plots. The detail exploited is that each cumulative probability scale on the x axis is known to stretch from 0 to 1, so 1.1 is a safe place to put the box plots. Naturally there are other ways to do that: you could call up egen repeatedly to put the five-number summaries (extremes, quartiles, median) in new variables. With this design there is no need for arbitrary criteria on what to show beyond the quartiles. (Tukey's rule of thumb to plot points individually if and only if they lie at least 1.5 IQR from the nearer quartile seems to remain the most common convention, but there is all too much evidence that it is, variously, not explained at all by authors who use it, or poorly explained, or even incorrectly explained. It may safely be guessed that readers are often not well placed to know better.
As a twist on that, let's use the normal distribution as reference. That doesn't imply an attitude that the data are, or should be, normally distributed, any more than the default quantile plot implies that the data are, or should be, uniformly distributed.
In practice you may need two passes, or a small calculation on the side, to work out a good location for the box plots.
As yet another twist, we revert to Parzen's original idea that the quantile plot and the box plot can share exactly the same space: not only are median and quartiles levels that can be shown on the quantile axis, so also 0.25 0.5 0.75 are levels that can be shown on the probability axis.
I like the mantra for explaining the idea that
if half the data points -- those between the quartiles -- lie inside each box, then as part of the same idea half the data points must lie outside each box -- and often they are the more interesting or important half!
However, once that point is seen, I tend to move back to showing quantile and box plots side by side.
.
This last example uses a different dataset, from Mardia et al. (1979, 2024) The appearance of a second edition 45 years after the first should inspire technical authors everywhere.
Here, as often, we do need a different layout for best presentation, and we should want to avoid the dopeyness of alphabetical order (algebra first!). myaxis from the Stata Journal helps there.
Mardia, K. V., J. T. Kent and J. M. Bibby. 1979. Multivariate Analysis. London: Academic Press.
Mardia, K. V., J. T. Kent and C. C. Taylor. 2024. Multivariate Analysis. Hoboken, NJ: John Wiley.
Note that pctilesets allows many other choices. For example, plotting whiskers stretching to say 5% and 95% points, or 10% and 90% points, is to my mind a simpler alternative to the Tukey convention (and an older idea, stretching back to Bowley (and Galton before him)).
Anyone interested -- except those behind fearsome firewalls -- can naturally download the code and look at the help file, noting that an ancillary file contains all the code for the examples.
A sequel command for sets of moment-based measures is already done and will be posted shortly. .
The small theme of exploiting addplot() options was also flagged recently in https://www.statalist.org/forums/for...addplot-option
Stata 8.1 is required.
Here is the executive summary (edited very slightly) from
Code:
ssc desc pctilesets
pctilesets computes percentile or quantile sets. It is a
wrapper for summarize. Typically the user will specify one or
more of the options minimum, maximum or pctile() to select
extremes and/or any or all of the percentiles for 1, 5, 10, 25
(lower quartile), 50 (median), 75 (upper quartile), 90, 95 and
99% cumulative probability. There are two syntaxes, for
variables and for groups of observations. pctilesets by default
lists its results. Although saving to a permanent dataset is
optional, that is the intended key to many useful applications,
graphical and otherwise.
A percentile set consists of a temporary dataset consisting of some or occasionally all of the following variables.
* varname is a string variable holding the name or names of the variable(s) being summarized.
* varlabel is a string variable holding the variable label of each variable being summarized. If no variable label has been defined, the value is instead the
variable name.
* (Groups syntax only) origgvar is a numeric or string variable as specified in the over() option.
* (Groups syntax only) groupvar is a string variable holding the name of the group variable specified in the over() option.
* (Groups syntax only) gvarlabel is a string variable holding the variable label of the group variable groupvar specified in the over() option. If no variable label
has been defined, the value is instead the variable name.
* (Groups syntax only) group is a numeric variable with value labels describing each distinct value of groupvar. Each such variable has integer values 1 up and
value labels derived from the variable specified.
* n is a numeric variable holding the number of observations used in the estimate.
* Any or all of min, p1, p5, p10, p25, p50, p75, p90, p95, p99 or max holding results for the measure concerned.
(Detail: As first posted, the help file contains a typo in the explanation above that will be corrected in due course.)
* weights is a string variable appearing if (and only if) weights were specified as a record of such use.
Examples may well serve better than any amount of exegesis. Those here stress graphical applications, which I have most in mind for my own use, but other uses may well occur to you.
In this first example, I use qplot from the Stata Journal for quantile plots and in the spirit, if not quite the letter, of Emanuel Parzen https://www.jstor.org/stable/2286734 present hybrid quantile-box plots that aim for a combination of the informative detail of quantile plots with the skeletal summary of box plots. The detail exploited is that each cumulative probability scale on the x axis is known to stretch from 0 to 1, so 1.1 is a safe place to put the box plots. Naturally there are other ways to do that: you could call up egen repeatedly to put the five-number summaries (extremes, quartiles, median) in new variables. With this design there is no need for arbitrary criteria on what to show beyond the quartiles. (Tukey's rule of thumb to plot points individually if and only if they lie at least 1.5 IQR from the nearer quartile seems to remain the most common convention, but there is all too much evidence that it is, variously, not explained at all by authors who use it, or poorly explained, or even incorrectly explained. It may safely be guessed that readers are often not well placed to know better.
Code:
. sysuse auto, clear . pctilesets mpg, p(25 50 75) min max over(foreign) saving(foo, replace) . clonevar origgvar=foreign . merge m:1 origgvar using foo . gen where = 1.1 . #delimit ; . qplot mpg, ms(O) by(foreign, note("") legend(off)) addplot(scatter p50 where, ms(Dh) msize(medlarge) pstyle(p2) xla(0 1 0.25 "0.25" 0.5 "0.5" 0.75 "0.75") xtitle(Fraction of data) || rbar p75 p25 where, fcolor(none) barw(0.12) pstyle(p2) || rspike p25 min where, pstyle(p2) || rspike p75 max where, pstyle(p2)) name(QB1, replace); . #delimit cr
As a twist on that, let's use the normal distribution as reference. That doesn't imply an attitude that the data are, or should be, normally distributed, any more than the default quantile plot implies that the data are, or should be, uniformly distributed.
In practice you may need two passes, or a small calculation on the side, to work out a good location for the box plots.
Code:
. replace where = 2.7 . #delimit ; . qplot mpg, ms(O) by(foreign, note("") legend(off)) xla(-2/2) xtitle(Standard normal deviate) addplot(scatter p50 where, ms(Dh) msize(medlarge) pstyle(p2) || rbar p75 p25 where, fcolor(none) barw(0.44) pstyle(p2) || rspike p25 min where, pstyle(p2) || rspike p75 max where, pstyle(p2)) trscale(invnormal(@)) name(QB2, replace); . #delimit cr
As yet another twist, we revert to Parzen's original idea that the quantile plot and the box plot can share exactly the same space: not only are median and quartiles levels that can be shown on the quantile axis, so also 0.25 0.5 0.75 are levels that can be shown on the probability axis.
I like the mantra for explaining the idea that
if half the data points -- those between the quartiles -- lie inside each box, then as part of the same idea half the data points must lie outside each box -- and often they are the more interesting or important half!
However, once that point is seen, I tend to move back to showing quantile and box plots side by side.
.
This last example uses a different dataset, from Mardia et al. (1979, 2024) The appearance of a second edition 45 years after the first should inspire technical authors everywhere.
Here, as often, we do need a different layout for best presentation, and we should want to avoid the dopeyness of alphabetical order (algebra first!). myaxis from the Stata Journal helps there.
Code:
. use https://www.stata-journal.com/software/sj20-2/pr0046_1/mathsmarks, clear . rename * (marks*) . gen id = _n . reshape long marks, i(id) j(subject) string . myaxis subject2=subject, sort(median marks) . qplot marks, by(subject2, row(1) compact) ytitle(Mathematics marks) . pctilesets marks, over(subject2) min max p(25 50 75) saving(foo, replace) . gen where = 1.1 . clonevar origgvar=subject2 . merge m:1 origgvar using foo . #delimit ; . qplot marks, by(subject2, row(1) compact legend(off) note("")) addplot(rspike p25 min where, pstyle(p2) || rspike p75 max where, pstyle(p2) || rbar p25 p75 where, barw(0.16) fcolor(none) pstyle(p2) || scatter p50 where, ms(Dh) msize(medlarge) pstyle(p2)) ytitle(Mathematics marks) xla(0 0.25 "0.25" 0.5 "0.5" 0.75 "0.75" 1) xtitle(Fraction of data) name(QB4, replace); . #delimit cr
Mardia, K. V., J. T. Kent and J. M. Bibby. 1979. Multivariate Analysis. London: Academic Press.
Mardia, K. V., J. T. Kent and C. C. Taylor. 2024. Multivariate Analysis. Hoboken, NJ: John Wiley.
Note that pctilesets allows many other choices. For example, plotting whiskers stretching to say 5% and 95% points, or 10% and 90% points, is to my mind a simpler alternative to the Tukey convention (and an older idea, stretching back to Bowley (and Galton before him)).
Anyone interested -- except those behind fearsome firewalls -- can naturally download the code and look at the help file, noting that an ancillary file contains all the code for the examples.
A sequel command for sets of moment-based measures is already done and will be posted shortly. .
The small theme of exploiting addplot() options was also flagged recently in https://www.statalist.org/forums/for...addplot-option
Comment