Adding marginal spike histograms to quantile and cumulative distribution plots

Nick Cox

Join Date: Mar 2014

Posts: 35433
#1

Adding marginal spike histograms to quantile and cumulative distribution plots

21 Dec 2020, 12:40

Some intriguing examples in Frank Harrell's Regression Modeling Strategies (2015: Cham: Springer) led me to play with addition of marginal spike histograms to quantile and (empirical) cumulative distribution (function) plots.

The point is at least two-fold:

Pedagogic. Such additions help to make clear what (e.g.) a quantile plot means and how it can be explained as stacking values in terms of their associated cumulative probabilities. Every new kind of graph becomes comfortable only with increased familiarity, and I've encountered students and colleagues evidently long past their first histogram who struggle a little with quantile plots.

Practical. In a strong sense marginal histograms add no more information to a quantile plot. In practice, they offer a complementary view of each distribution, affording another way to think about distribution level, spread and shape -- and indeed fine structure too.

I'll phrase discussion here entirely in terms of official Stata commands, but note that the discussion extends to community-contributed (user-written) commands such as qplot or distplot (Stata Journal) -- where there is scope for programmers to make this easier by adding dedicated options.

Let's start with a common or garden quantile plot. Some small but I hope helpful details: I note that official command quantile adds a reference line which allows comparison with a uniform (rectangular, flat) distribution with the same range as the observed data. I almost never want that, so I usually make it invisible by changing its colour. I also vary the marker symbol and vertical axis label angle.

Code:

sysuse auto, clear set scheme s1color quantile mpg, rlopts(lc(none)) ms(oh) yla(, ang(h)) name(G1, replace)

The name quantile plot goes back at least to

Wilk, M.B. and Gnanadesikan, R. 1968. Probability plotting methods for the analysis of data. Biometrika 55: 1--17.

although the idea is much older, being used several times in the 19th century. As implemented in quantile the idea is to plot values in rank order versus cumulative probability, conventionally (unique rank - 0.5) / sample size.

As one axis is a probability scale, that is compatible in principle with a marginal histogram showing probabilities in each distinct bin. There are various fairly simple ways to get that directly. Here's one.

Code:

count if mpg < . egen prob = total(1/`r(N)'), by(mpg) egen tag = tag(mpg)

Counting the sample size explicitly does no harm and is needed if there are missing values (not the case here in this example, but often needed) and/or you want to look at a subset of your data (again not the case here, but often needed).

A first stab just adds spikes to the vertical axis

Code:

quantile mpg, rlopts(lc(none)) ms(oh) yla(, ang(h)) addplot(spike prob mpg if tag, horizontal) legend(off) name(G2, replace)

That is a good start, but perhaps we'd be better off flipping the histogram so that it's clear of the quantile plot.

Code:

gen nprob = -prob quantile mpg, rlopts(lc(none)) ms(oh) yla(, ang(h)) addplot(spike nprob mpg if tag, horizontal) legend(off) name(G3, replace)

The point about adding if tag is partly one of efficiency and partly one of avoiding monitor artefacts.

In this example, we were lucky: there are 21 distinct values, so the mean probability in each bin is about 0.05. The maximum probability may naturally be much higher but here taking probabilities literally (which means numerically) seems to work fine.

In other examples we might have do more work -- to make some decisions -- to get an extra histogram that is informative yet restrained.

Code:

webuse nlswork, clear egen tag = tag(ln_wage) count if ln_wage < . egen prob = total(1/`r(N)'), by(ln_wage) gen nprob = -prob quantile ln_wage, rlopts(lc(none)) ms(oh) legend(off) addplot(spike nprob ln_wage if tag, horizontal) yla(, ang(h)) name(G4, replace)

Here there are 8173 distinct values and the mean probability in each bin is about .0001. The maximum probability again can be much higher but here taking probabilities literally (which means numerically) means that the histogram degenerates to a rug plot. If you want a rug plot, you can always get one.

There are various ways to move forward. One is to decide how much space to give the histogram and to scale accordingly. Then the probability scale for the quantile plot and that for the quantile plot are different, which needs to be explained somewhere. Another is to use a square root scale for probabilities, which often works well to make structure clearer. The two can be combined.... .

Code:

su prob, meanonly gen nprob_scaled = nprob * (0.1 / r(max)) gen prob_root = -sqrt(prob) quantile ln_wage, rlopts(lc(none)) ms(oh) legend(off) addplot(spike nprob_scaled ln_wage if tag, horizontal) yla(, ang(h)) name(G5, replace) quantile ln_wage, rlopts(lc(none)) ms(oh) legend(off) addplot(spike prob_root ln_wage if tag, horizontal) yla(, ang(h)) name(G6, replace)

Scale histogram to cover 10% of horizontal extent of quantile plot:

Square root scale:
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35433
#2

21 Dec 2020, 12:43

The last graph appears here, given a limit of 5 images per post.

Histograms using square root scales -- rootograms! -- go back at least to John W. Tukey circa 1965. More discussion and examples in the manual entry [R] spikeplot.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#3

05 Oct 2021, 08:23

Now written up in the Stata Journal and immediately accessible at https://journals.sagepub.com/doi/pdf...6867X211045583 (regardless of whether you or your workplace has a subscription).
1 like
Comment

Announcement

Adding marginal spike histograms to quantile and cumulative distribution plots

Comment

Comment