myrank downloadable from SSC

Nick Cox

Join Date: Mar 2014

Posts: 35239
#1

myrank downloadable from SSC

11 Feb 2025, 09:47

Thanks as always to Kit Baum, a new command myrank is now downloadable from SSC. Stata 8.2 is required.

In a nutshell, it's a small utility or helper command to create a variable for one axis of a very specific graph -- but a graph that can be helpful.

To see the small point here, let's look at a typical application of the command.

Code:

. sysuse auto, clear (1978 automobile data) . myrank rank=mpg, over(foreign) gap(2) . scatter mpg rank, xla(`mid1' "Domestic" `mid2' "Foreign", tlc(none)) xli(`gap1', lp(solid)) xtitle("")

The result is a side-by-side quantile plot. You may know other terms: these plots, or relatives of them, have been called rank-size plots; named for G.K. Zipf, Robert Whittaker and Jan Pen; or dubbed value, hypsometric or flow duration curves. (Yet other terms to add to the menagerie would be welcome.)

Whatever the name, they are compact displays of frequency distributions, which show level, spread, skewness, gaps, spikes and outliers quite simply, without arbitrary or even reasoned choices of smoothing, binning, jittering, or which points in the tail(s) to show as such.

The y axis variable is clear enough -- here mpg -- and in passing let's flag that logarithmic or other scales might help in many applications.

You know you've been using Stata a great deal if these data are so familiar that you remember that there are 52 domestic cars and 22 foreign cars.

The x axis variable is a kind of rank. We do in fact have ranks 1 to 52 for 52 domestic cars, but then we plot mpg against 55 (not 53) to 76 (similarly. not 74) for the foreign cars. The gap is cosmetic and a desired side-effect of the option choice 2 in the call to myrank. Here the gap allows a separating line to be plotted, and in other cases you might want to plot other stuff in the gap, even a compact box plot. But in my design the reader never sees the axis labels and above all is not expected to decode 55 to 76 as happening to be 52 + gap of 2 + (1 to 22). Instead the graph code exploits a little feature: myrank leaves in its wake various local macros telling you the positions of the middles of each group (so that you can place suitable descriptive text) and of the gaps (so that you can plot separating lines, or whatever else).

If you wanted a design like the above, you could work out the pesky details of what the ranks should be, but it's not much fun doing that once, and even less fun doing it repeatedly.

A natural question (for me to ask myself, even if it occurs to no-one else) is why don't I put it all together as a graph command? I may yet do that, but in this version the emphasis is on separating out the silly stuff, so that no-one is distracted by graphical choices that don't appeal, and everyone so minded can work out code for graphs that do appeal.

Another natural question is why you can't do this already, to which the answer is that you can, to some extent.

You can get quite close with graph dot and an undocumented option and some tiny tricks: details in the help for myrank.

Official command quantile is still very limited in scope, but qplot from the Stata Journal allows things like this:

Code:

qplot mpg, by(foreign, note("")) xtitle(Fraction of data) ms(O)

qplot allows other choices, too, not the issue at the moment.

The downside of this use of a by() option is awkwardness or inefficiency in showing panels of the same size for groups with different numbers of observations, and that is the rationale for the main idea. (It could be much worse if some groups are very small and others much larger: the help file gives code for an example with the auto data and repair record rep78, for which groups can be as small as 2.

This design allows much scope for elaborations and variations. I'll leave marginal box plots as a hint, and show a simple annotation with means as one example. (Clearly. you could choose medians, geometric means, trimmed means. ..., as desired.)

Code:

. egen mean = mean(mpg) , by(foreign) . separate mean, by(foreign) veryshortlabel Variable Storage Display Value name type format label Variable label ------------------------------------------------------------------------------------------------------------------------- mean0 float %9.0g Domestic mean1 float %9.0g Foreign . scatter mpg rank, xla(`mid1' "Domestic" `mid2' "Foreign", tlc(none)) xli(`gap1', lp(solid)) xtitle("") || line mean? rank, sort lc(black ..) legend(off) note(horizontal lines show means) ytitle("`: var label mpg'")

It can be salutary to recall that the quantile plot is a discrete representation of an underlying quantile function on support 0 to 1 and that the mean is the area under the curve over that support. However, we aren't showing zero on the y axis or making that explicit by shading an area.

For other variations on the same small theme, see https://journals.sagepub.com/doi/pdf...6867X241297949
Tags: None

3 likes

Announcement

myrank downloadable from SSC