Binary outcome, categorical predictor(s): plotting counts and rates

Nick Cox

Join Date: Mar 2014

Posts: 35208
#1

Binary outcome, categorical predictor(s): plotting counts and rates

23 Nov 2024, 13:02

This thread grows out of starting to read

Maindonald, J.H., Braun, W.J. and Andrews, J.L. 2024. A Practical Guide to Data Analysis Using R: An Example-Based Approach. Cambridge: Cambridge University Press.
https://www.cambridge.org/core/books...alysis-using-r

If you don't know it, you may know an earlier book by the first two authors which went through three editions between 2003 and 2010. I liked that book because it was a good sampling of various topics in statistics, with a practical and modern take on strategy and style.

This is intended as the first of three posts, the split into separate posts being driven by Statalist rules on number of images in a post and -- more capriciously -- by my other commitments today.

You shouldn't read too much -- indeed anything -- into my reading a book that uses R. I am interested in good statistics anywhere and in borrowing good ideas for my own work using Stata.

On pp.20 and 89 there is a plot I don't recall seeing before. It's not included in their previous book.

The context is a binary outcome, say scored 0 and 1 so that the mean is the proportion of the state coded 1, which naturally may be presented as a percent rate if that is congenial.

In addition you have at least one categorical predictor, although life becomes more interesting with two categorical predictors.

Let's warm up with foreign and rep78 from the auto data, if only because Stata users have easy access to that dataset and because many readers will be long familiar with that dataset.

For the purposes of this thread I am regarding foreign as the outcome and rep78 as a predictor. That doesn't have to be convincing; I just want a sandbox for graphics, and better examples are coming right up in later posts.

This isn't an exact translation of what these authors -- I will call them MBA -- do in R, but it's I think identical in spirit.

Code:

sysuse auto, clear egen percent = mean(100 * foreign) if rep78 < ., by(rep78) label var percent "% foreign" bysort rep78 : gen count = _N twoway dropline count percent, subtitle("% foreign given repair record" " ", placement(w)) /// || scatter count percent, ms(none) mlabel(rep78) mlabsize(*2) legend(off)

So we are summarizing a 2 x 5 contingency table by

5 marginal frequencies by predictor

5 mean outcomes

The simple but crucial point is: Don't just look at variations in outcome. See what subsample size underlies them.

Now I don't think there is anything there that isn't as clear or clearer otherwise in a table or some simple bar chart. But I hope you agree that we're showing a relationship or association -- % foreign increases from 0 for repair record 1 and 2 to over 80% for repair record 5. Promise: the design becomes more useful with more predictors.

A detailed difference from MBA is that I use twoway dropline whereas the equivalent of what they do would be twoway spike.

There are at least two reasons for tending to use dropline; one of which you can see: two spikes have the same horizontal position and so we need to see them as distinctly as possible.

If you're wondering why the outcome is plotted on the horizontal axis, I agree with that wondering, and comments will appear in later posts.

If you're preferring labels 0(25)199 for the horizontal axis, I often agree with that too.
Tags: None

1 like
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#2

23 Nov 2024, 14:46

If you're preferring labels 0(25)199 for the horizontal axis, I often agree with that too.

Did you mean 0(25)100? I don't get why you would want labels beyond 100 here.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35208
#3

23 Nov 2024, 14:50

Yes; that’s a silly typo. More coming soon.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35208
#4

23 Nov 2024, 17:02

On seeing the example on p.20 of the book discussed in #1 I thought this is a design to try out with the Berkeley admissions data. And sure enough on p.89 the authors get to that.

To back up for anyone new or fuzzy about this dataset: The story goes back to the mid-1970s and concern that overall admission rate of females to various majors at Berkeley was markedly below that of males. The original paper in Science magazine in 1975 didn't list the data usually discussed more recently in books and papers. The version here is given in any edition of Alan Agresti's Categorical Data Analysis text from John Wiley.

So the outcome is (admitted, rejected) and the predictors are majors A B C D E F and male or female. The data arrive aggregated in frequencies.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte(major female) float admitted int frequency 1 0 0 313 1 0 1 512 1 1 0 19 1 1 1 89 2 0 0 207 2 0 1 353 2 1 0 8 2 1 1 17 3 0 0 205 3 0 1 120 3 1 0 391 3 1 1 202 4 0 0 279 4 0 1 138 4 1 0 244 4 1 1 131 5 0 0 138 5 0 1 53 5 1 0 299 5 1 1 94 6 0 0 351 6 0 1 22 6 1 0 317 6 1 1 24 end label values major Dept label def Dept 1 "A", modify label def Dept 2 "B", modify label def Dept 3 "C", modify label def Dept 4 "D", modify label def Dept 5 "E", modify label def Dept 6 "F", modify label values female female label def female 0 "male", modify label def female 1 "female", modify label values admitted admitted label def admitted 0 "rejected", modify label def admitted 1 "admitted", modify tab female admitted

Some preliminary calculations lead to a graph similar to that presented by authors MBA.

Code:

bysort major female (admitted) : egen total = total(frequency) by major female : gen success = 100 * frequency[2] / total label var success "% admitted" label var total "# applications" separate total, by(female) veryshortlabel twoway dropline total? success if admitted || scatter total? success if admitted, /// ms(none ..) mlabel(major major) mlabpos(12 ..) mlabsize(medium ..) mlabc(stc1 stc2) /// note(Berkeley admissions data) /// legend(order(1 2) row(1) pos(12)) xla(0(25)100) ytitle(# applications)

The graph has to be read major by major. For majors A, B, D, F females are admitted at higher rates than males, while for majors C. E it's the other way round. The appearance of bias against females is a side-effect of differences in frequencies, starting with the fact that females apply in much smaller numbers than males to the majors with highest admission rates. The contradiction between aggregated and disaggregated patterns is an amalgamation paradox often named Simpson's paradox. E.H. SImpson certainly wrote about it, as did G,U. Yule and Karl Pearson before him. What the paradox is called is an example of Stigler's Law to the effect that things get named after people who weren't the first inventors or discoverers (and sure enough, Stigler knew that he wasn't the first to point that out, so the name is self-exemplifying.)

The pattern and its explanation have been well known and were the main story in the original paper from 1975. The question remaining is how best to show it and this graph is a competitor.

As a variation on the standard, I plotted results separately by major.

Code:

twoway dropline total? success if admitted, /// by(major, note(Berkeley admissions data) legend(pos(12))) /// legend(row(1)) ytitle(# applications) xla(0(25)100)

As a final twist, I followed the logic, or rather the convention, that outcome should be plotted on the vertical axis.

Code:

twoway dropline success total if !female || dropline success total if female, /// by(major, note(Berkeley admissions data) legend(pos(12))) /// legend(order(1 "male" 2 "female") row(1)) yla(0(25)100)

I don't think it is so clear, which may well be why MBA plot it the way they do.

These data have been plotted in various ways, such as using mosaic plots. Another graphical analysis was included in my talk at the September Stata meeting in London. See slides 53 and 54 of my presentation under https://www.stata.com/meeting/uk24/

A small puzzle remains what to call this graph, if indeed it really needs a name. MBA plot runs the risk of exemplifying Stigler's Law and is a name all too likely to be misunderstood for other reasons.

Next posting tomorrow, I hope.

Last edited by Nick Cox; 23 Nov 2024, 17:04.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35208
#5

24 Nov 2024, 06:44

Proper references for #4

Bickel, P. J., E. A. Hammel, and J. W. O’Connell. 1975. Sex Bias in Graduate Admissions: Data from Berkeley. Science 187: 398–404. http://www.jstor.org/stable/1739581.

Agresti, A. 2013. Categorical Data Analysis. Hoboken, NJ: John Wiley. p.63

Here is another example with data from

Appleton, D. R., J. M. French, and M. P. J. Vanderpump. 1996. Ignoring a Covariate: An Example of Simpson’s Paradox. The American Statistician 50: 340–41.
https://doi.org/10.2307/2684931. https://www.jstor.org/stable/2684931

The story is of women of various ages from Whickham, near Newcastle in the north-east of England, classified by present (tobacco) smoker or never smoked, according to whether they had died 20 years after a survey. (Women who had smoked but given up are excluded.)

An overall analysis suggests that smoking leads to higher rates of survival, but as before you need to look at each age group,

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(age smoker) byte dead float freq 1 1 1 2 1 0 1 1 1 1 0 53 1 0 0 61 2 1 1 3 2 0 1 5 2 1 0 121 2 0 0 152 3 1 1 14 3 0 1 7 3 1 0 95 3 0 0 114 4 1 1 27 4 0 1 12 4 1 0 103 4 0 0 66 5 1 1 51 5 0 1 40 5 1 0 64 5 0 0 81 6 1 1 29 6 0 1 101 6 1 0 7 6 0 0 28 7 1 1 13 7 0 1 64 7 1 0 0 7 0 0 0 end label values age age label def age 1 "18-24", modify label def age 2 "25-34", modify label def age 3 "35-44", modify label def age 4 "45-54", modify label def age 5 "55-64", modify label def age 6 "65-74", modify label def age 7 "75+", modify label values smoker smoker label def smoker 0 "non-smoker", modify label def smoker 1 "smoker", modify label values dead dead label def dead 0 "alive", modify label def dead 1 "dead", modify bysort age smoker (dead) : egen total = total(freq) by age smoker : gen failure = 100 * freq[2]/total label var failure "% dead" label var total "# women" separate total, by(smoker) veryshortlabel set scheme stcolor twoway spike total? failure if dead || scatter total? failure if dead, /// ms(none ..) mlabel(age age) mlabpos(12 ..) mlabsize(medium ..) mlabc(stc1 stc2) /// note(Whickham data) /// legend(order(1 2) row(1) pos(12)) xla(0(25)100) ytitle(# women) name(App0, replace) twoway spike total? failure if dead, /// by(age, row(1) compact note(Whickham data) legend(pos(12))) /// legend(row(1)) ytitle(# women) xla(0(25)100) name(App1, replace) twoway dropline failure total if !smoker || dropline failure total if smoker, /// by(age, row(1) compact note(Whickham data) legend(pos(12))) /// legend(order(1 "non-smoker" 2 "smoker") row(1)) yla(0(25)100) name(App2, replace)

I mix spike and dropline styles a little mischievously to make any interested reader (and myself) wonder what works best.

Each graph is a little challenging but each has positive value. The last makes the simple point that most of the women 18-24 were still alive 20 years later while none of the women 75+ were. Broadly survival declines with age. So far, so unsurprising, but in each graph the detail is in comparing smokers and non-smokers, which does come out as one would expect.

Transforming the % dead scale is another option.

Naturally no serious medical statistics or epidemiology project would focus on these variables alone. My interest is in which graphs best show the patterns -- and indeed if anyone has seen this design before.
Comment

Announcement

Binary outcome, categorical predictor(s): plotting counts and rates

Comment

Comment

Comment

Comment