Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Binary outcome, categorical predictor(s): plotting counts and rates

    This thread grows out of starting to read

    Maindonald, J.H., Braun, W.J. and Andrews, J.L. 2024. A Practical Guide to Data Analysis Using R: An Example-Based Approach. Cambridge: Cambridge University Press.
    https://www.cambridge.org/core/books...alysis-using-r

    If you don't know it, you may know an earlier book by the first two authors which went through three editions between 2003 and 2010. I liked that book because it was a good sampling of various topics in statistics, with a practical and modern take on strategy and style.

    This is intended as the first of three posts, the split into separate posts being driven by Statalist rules on number of images in a post and -- more capriciously -- by my other commitments today.

    You shouldn't read too much -- indeed anything -- into my reading a book that uses R. I am interested in good statistics anywhere and in borrowing good ideas for my own work using Stata.

    On pp.20 and 89 there is a plot I don't recall seeing before. It's not included in their previous book.

    The context is a binary outcome, say scored 0 and 1 so that the mean is the proportion of the state coded 1, which naturally may be presented as a percent rate if that is congenial.

    In addition you have at least one categorical predictor, although life becomes more interesting with two categorical predictors.

    Let's warm up with foreign and rep78 from the auto data, if only because Stata users have easy access to that dataset and because many readers will be long familiar with that dataset.

    For the purposes of this thread I am regarding foreign as the outcome and rep78 as a predictor. That doesn't have to be convincing; I just want a sandbox for graphics, and better examples are coming right up in later posts.

    This isn't an exact translation of what these authors -- I will call them MBA -- do in R, but it's I think identical in spirit.

    Code:
    sysuse auto, clear
    egen percent = mean(100 * foreign) if rep78 < ., by(rep78)
    label var percent "% foreign"
    bysort rep78 : gen count = _N
    
    twoway dropline count percent, subtitle("% foreign given repair record" " ", placement(w)) ///
    || scatter count percent, ms(none) mlabel(rep78) mlabsize(*2) legend(off)

    Click image for larger version

Name:	MBA1.png
Views:	1
Size:	38.7 KB
ID:	1768147

    So we are summarizing a 2 x 5 contingency table by

    5 marginal frequencies by predictor

    5 mean outcomes

    The simple but crucial point is: Don't just look at variations in outcome. See what subsample size underlies them.

    Now I don't think there is anything there that isn't as clear or clearer otherwise in a table or some simple bar chart. But I hope you agree that we're showing a relationship or association -- % foreign increases from 0 for repair record 1 and 2 to over 80% for repair record 5. Promise: the design becomes more useful with more predictors.

    A detailed difference from MBA is that I use twoway dropline whereas the equivalent of what they do would be twoway spike.

    There are at least two reasons for tending to use dropline; one of which you can see: two spikes have the same horizontal position and so we need to see them as distinctly as possible.

    If you're wondering why the outcome is plotted on the horizontal axis, I agree with that wondering, and comments will appear in later posts.

    If you're preferring labels 0(25)199 for the horizontal axis, I often agree with that too.




  • #2
    If you're preferring labels 0(25)199 for the horizontal axis, I often agree with that too.
    Did you mean 0(25)100? I don't get why you would want labels beyond 100 here.

    Comment


    • #3
      Yes; that’s a silly typo. More coming soon.

      Comment


      • #4
        On seeing the example on p.20 of the book discussed in #1 I thought this is a design to try out with the Berkeley admissions data. And sure enough on p.89 the authors get to that.

        To back up for anyone new or fuzzy about this dataset: The story goes back to the mid-1970s and concern that overall admission rate of females to various majors at Berkeley was markedly below that of males. The original paper in Science magazine in 1975 didn't list the data usually discussed more recently in books and papers. The version here is given in any edition of Alan Agresti's Categorical Data Analysis text from John Wiley.

        So the outcome is (admitted, rejected) and the predictors are majors A B C D E F and male or female. The data arrive aggregated in frequencies.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input byte(major female) float admitted int frequency
        1 0 0 313
        1 0 1 512
        1 1 0  19
        1 1 1  89
        2 0 0 207
        2 0 1 353
        2 1 0   8
        2 1 1  17
        3 0 0 205
        3 0 1 120
        3 1 0 391
        3 1 1 202
        4 0 0 279
        4 0 1 138
        4 1 0 244
        4 1 1 131
        5 0 0 138
        5 0 1  53
        5 1 0 299
        5 1 1  94
        6 0 0 351
        6 0 1  22
        6 1 0 317
        6 1 1  24
        end
        label values major Dept
        label def Dept 1 "A", modify
        label def Dept 2 "B", modify
        label def Dept 3 "C", modify
        label def Dept 4 "D", modify
        label def Dept 5 "E", modify
        label def Dept 6 "F", modify
        label values female female
        label def female 0 "male", modify
        label def female 1 "female", modify
        label values admitted admitted
        label def admitted 0 "rejected", modify
        label def admitted 1 "admitted", modify
        
        tab female admitted
        Some preliminary calculations lead to a graph similar to that presented by authors MBA.

        Code:
        bysort major female (admitted) : egen total = total(frequency)
        
        by major female : gen success = 100 * frequency[2] / total
        
        label var success "% admitted"
        label var total "# applications"
        
        separate total, by(female) veryshortlabel
        
        twoway dropline total? success if admitted || scatter total? success if admitted, ///
        ms(none ..) mlabel(major major) mlabpos(12 ..) mlabsize(medium ..) mlabc(stc1 stc2) ///
        note(Berkeley admissions data) ///
        legend(order(1 2) row(1) pos(12)) xla(0(25)100) ytitle(# applications)
        Click image for larger version

Name:	MBA2.png
Views:	1
Size:	44.8 KB
ID:	1768157



        The graph has to be read major by major. For majors A, B, D, F females are admitted at higher rates than males, while for majors C. E it's the other way round. The appearance of bias against females is a side-effect of differences in frequencies, starting with the fact that females apply in much smaller numbers than males to the majors with highest admission rates. The contradiction between aggregated and disaggregated patterns is an amalgamation paradox often named Simpson's paradox. E.H. SImpson certainly wrote about it, as did G,U. Yule and Karl Pearson before him. What the paradox is called is an example of Stigler's Law to the effect that things get named after people who weren't the first inventors or discoverers (and sure enough, Stigler knew that he wasn't the first to point that out, so the name is self-exemplifying.)

        The pattern and its explanation have been well known and were the main story in the original paper from 1975. The question remaining is how best to show it and this graph is a competitor.

        As a variation on the standard, I plotted results separately by major.


        Code:
        twoway dropline total? success if admitted, ///
        by(major, note(Berkeley admissions data) legend(pos(12))) ///
        legend(row(1)) ytitle(# applications) xla(0(25)100)
        Click image for larger version

Name:	MBA3.png
Views:	1
Size:	51.8 KB
ID:	1768158


        As a final twist, I followed the logic, or rather the convention, that outcome should be plotted on the vertical axis.

        Code:
        twoway dropline success total if !female || dropline success total if female, ///
        by(major, note(Berkeley admissions data) legend(pos(12))) ///
        legend(order(1 "male" 2 "female") row(1)) yla(0(25)100)
        Click image for larger version

Name:	MBA4.png
Views:	1
Size:	54.0 KB
ID:	1768159


        I don't think it is so clear, which may well be why MBA plot it the way they do.

        These data have been plotted in various ways, such as using mosaic plots. Another graphical analysis was included in my talk at the September Stata meeting in London. See slides 53 and 54 of my presentation under https://www.stata.com/meeting/uk24/

        A small puzzle remains what to call this graph, if indeed it really needs a name. MBA plot runs the risk of exemplifying Stigler's Law and is a name all too likely to be misunderstood for other reasons.

        Next posting tomorrow, I hope.
        Last edited by Nick Cox; 23 Nov 2024, 17:04.

        Comment


        • #5
          Proper references for #4

          Bickel, P. J., E. A. Hammel, and J. W. O’Connell. 1975. Sex Bias in Graduate Admissions: Data from Berkeley. Science 187: 398–404. http://www.jstor.org/stable/1739581.

          Agresti, A. 2013. Categorical Data Analysis. Hoboken, NJ: John Wiley. p.63

          Here is another example with data from

          Appleton, D. R., J. M. French, and M. P. J. Vanderpump. 1996. Ignoring a Covariate: An Example of Simpson’s Paradox. The American Statistician 50: 340–41.
          https://doi.org/10.2307/2684931. https://www.jstor.org/stable/2684931

          The story is of women of various ages from Whickham, near Newcastle in the north-east of England, classified by present (tobacco) smoker or never smoked, according to whether they had died 20 years after a survey. (Women who had smoked but given up are excluded.)

          An overall analysis suggests that smoking leads to higher rates of survival, but as before you need to look at each age group,

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input float(age smoker) byte dead float freq
          1 1 1   2
          1 0 1   1
          1 1 0  53
          1 0 0  61
          2 1 1   3
          2 0 1   5
          2 1 0 121
          2 0 0 152
          3 1 1  14
          3 0 1   7
          3 1 0  95
          3 0 0 114
          4 1 1  27
          4 0 1  12
          4 1 0 103
          4 0 0  66
          5 1 1  51
          5 0 1  40
          5 1 0  64
          5 0 0  81
          6 1 1  29
          6 0 1 101
          6 1 0   7
          6 0 0  28
          7 1 1  13
          7 0 1  64
          7 1 0   0
          7 0 0   0
          end
          label values age age
          label def age 1 "18-24", modify
          label def age 2 "25-34", modify
          label def age 3 "35-44", modify
          label def age 4 "45-54", modify
          label def age 5 "55-64", modify
          label def age 6 "65-74", modify
          label def age 7 "75+", modify
          label values smoker smoker
          label def smoker 0 "non-smoker", modify
          label def smoker 1 "smoker", modify
          label values dead dead
          label def dead 0 "alive", modify
          label def dead 1 "dead", modify
          
          bysort age smoker (dead) : egen total = total(freq)
          
          by age smoker : gen failure = 100 * freq[2]/total 
          
          label var failure "% dead"
          label var total "# women"
          
          separate total, by(smoker) veryshortlabel 
          
          set scheme stcolor 
          
          twoway spike total? failure if dead || scatter total? failure if dead, ///
          ms(none ..) mlabel(age age) mlabpos(12 ..) mlabsize(medium ..) mlabc(stc1 stc2) ///
          note(Whickham data) /// 
          legend(order(1 2) row(1) pos(12)) xla(0(25)100) ytitle(# women) name(App0, replace)
          
          twoway spike total? failure if dead, ///
          by(age, row(1) compact note(Whickham data) legend(pos(12))) ///
          legend(row(1)) ytitle(# women) xla(0(25)100) name(App1, replace)
          
          twoway dropline failure total if !smoker || dropline failure total if smoker, ///
          by(age, row(1) compact note(Whickham data) legend(pos(12))) ///
          legend(order(1 "non-smoker" 2 "smoker") row(1)) yla(0(25)100) name(App2, replace)
          I mix spike and dropline styles a little mischievously to make any interested reader (and myself) wonder what works best.


          Click image for larger version

Name:	App0.png
Views:	1
Size:	50.9 KB
ID:	1768186


          Click image for larger version

Name:	App1.png
Views:	1
Size:	62.9 KB
ID:	1768187






          Click image for larger version

Name:	App2.png
Views:	1
Size:	65.7 KB
ID:	1768188



          Each graph is a little challenging but each has positive value. The last makes the simple point that most of the women 18-24 were still alive 20 years later while none of the women 75+ were. Broadly survival declines with age. So far, so unsurprising, but in each graph the detail is in comparing smokers and non-smokers, which does come out as one would expect.

          Transforming the % dead scale is another option.

          Naturally no serious medical statistics or epidemiology project would focus on these variables alone. My interest is in which graphs best show the patterns -- and indeed if anyone has seen this design before.

          Comment

          Working...
          X