This thread grows out of starting to read
Maindonald, J.H., Braun, W.J. and Andrews, J.L. 2024. A Practical Guide to Data Analysis Using R: An Example-Based Approach. Cambridge: Cambridge University Press.
https://www.cambridge.org/core/books...alysis-using-r
If you don't know it, you may know an earlier book by the first two authors which went through three editions between 2003 and 2010. I liked that book because it was a good sampling of various topics in statistics, with a practical and modern take on strategy and style.
This is intended as the first of three posts, the split into separate posts being driven by Statalist rules on number of images in a post and -- more capriciously -- by my other commitments today.
You shouldn't read too much -- indeed anything -- into my reading a book that uses R. I am interested in good statistics anywhere and in borrowing good ideas for my own work using Stata.
On pp.20 and 89 there is a plot I don't recall seeing before. It's not included in their previous book.
The context is a binary outcome, say scored 0 and 1 so that the mean is the proportion of the state coded 1, which naturally may be presented as a percent rate if that is congenial.
In addition you have at least one categorical predictor, although life becomes more interesting with two categorical predictors.
Let's warm up with foreign and rep78 from the auto data, if only because Stata users have easy access to that dataset and because many readers will be long familiar with that dataset.
For the purposes of this thread I am regarding foreign as the outcome and rep78 as a predictor. That doesn't have to be convincing; I just want a sandbox for graphics, and better examples are coming right up in later posts.
This isn't an exact translation of what these authors -- I will call them MBA -- do in R, but it's I think identical in spirit.

So we are summarizing a 2 x 5 contingency table by
5 marginal frequencies by predictor
5 mean outcomes
The simple but crucial point is: Don't just look at variations in outcome. See what subsample size underlies them.
Now I don't think there is anything there that isn't as clear or clearer otherwise in a table or some simple bar chart. But I hope you agree that we're showing a relationship or association -- % foreign increases from 0 for repair record 1 and 2 to over 80% for repair record 5. Promise: the design becomes more useful with more predictors.
A detailed difference from MBA is that I use twoway dropline whereas the equivalent of what they do would be twoway spike.
There are at least two reasons for tending to use dropline; one of which you can see: two spikes have the same horizontal position and so we need to see them as distinctly as possible.
If you're wondering why the outcome is plotted on the horizontal axis, I agree with that wondering, and comments will appear in later posts.
If you're preferring labels 0(25)199 for the horizontal axis, I often agree with that too.
Maindonald, J.H., Braun, W.J. and Andrews, J.L. 2024. A Practical Guide to Data Analysis Using R: An Example-Based Approach. Cambridge: Cambridge University Press.
https://www.cambridge.org/core/books...alysis-using-r
If you don't know it, you may know an earlier book by the first two authors which went through three editions between 2003 and 2010. I liked that book because it was a good sampling of various topics in statistics, with a practical and modern take on strategy and style.
This is intended as the first of three posts, the split into separate posts being driven by Statalist rules on number of images in a post and -- more capriciously -- by my other commitments today.
You shouldn't read too much -- indeed anything -- into my reading a book that uses R. I am interested in good statistics anywhere and in borrowing good ideas for my own work using Stata.
On pp.20 and 89 there is a plot I don't recall seeing before. It's not included in their previous book.
The context is a binary outcome, say scored 0 and 1 so that the mean is the proportion of the state coded 1, which naturally may be presented as a percent rate if that is congenial.
In addition you have at least one categorical predictor, although life becomes more interesting with two categorical predictors.
Let's warm up with foreign and rep78 from the auto data, if only because Stata users have easy access to that dataset and because many readers will be long familiar with that dataset.
For the purposes of this thread I am regarding foreign as the outcome and rep78 as a predictor. That doesn't have to be convincing; I just want a sandbox for graphics, and better examples are coming right up in later posts.
This isn't an exact translation of what these authors -- I will call them MBA -- do in R, but it's I think identical in spirit.
Code:
sysuse auto, clear egen percent = mean(100 * foreign) if rep78 < ., by(rep78) label var percent "% foreign" bysort rep78 : gen count = _N twoway dropline count percent, subtitle("% foreign given repair record" " ", placement(w)) /// || scatter count percent, ms(none) mlabel(rep78) mlabsize(*2) legend(off)
So we are summarizing a 2 x 5 contingency table by
5 marginal frequencies by predictor
5 mean outcomes
The simple but crucial point is: Don't just look at variations in outcome. See what subsample size underlies them.
Now I don't think there is anything there that isn't as clear or clearer otherwise in a table or some simple bar chart. But I hope you agree that we're showing a relationship or association -- % foreign increases from 0 for repair record 1 and 2 to over 80% for repair record 5. Promise: the design becomes more useful with more predictors.
A detailed difference from MBA is that I use twoway dropline whereas the equivalent of what they do would be twoway spike.
There are at least two reasons for tending to use dropline; one of which you can see: two spikes have the same horizontal position and so we need to see them as distinctly as possible.
If you're wondering why the outcome is plotted on the horizontal axis, I agree with that wondering, and comments will appear in later posts.
If you're preferring labels 0(25)199 for the horizontal axis, I often agree with that too.
Comment