Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to do a Box Plot with mean instead of median and SD instead of quartiles?

    Dear Statalisters,

    Please have a look at my data:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float event byte countrycode
    0 2
    0 4
    0 7
    0 7
    0 5
    1 3
    0 9
    1 4
    0 8
    0 5
    0 8
    0 5
    0 5
    0 5
    0 5
    0 3
    0 5
    0 3
    0 5
    0 7
    end
    event is a binary variable. I'd like to generate some visual descriptive statistics for event. Given that it is a binary variable, a box plot is unappropriate as there would be no boundaries. But I would like something that looks like a box plot, showing the mean where it traditionally shows the median, and the borders would be the mean +/- the standard deviation. To make things look clearer, here's an image of how it would look :

    Click image for larger version

Name:	image_27776.png
Views:	1
Size:	8.6 KB
ID:	1670700

    For each country code, the center of the box would be the mean of event, the left boundary of the box would be the mean - sd of event, and the left boundary of the box would be the mean + the sd of event, so the mean would be at the center of each box.

    Is there a way to generate such plots? I tried the command -graph box- but my attempts remained unsuccessful. Any help would be appreciated. Thanks a lot!
    Last edited by Julia Simon; 23 Jun 2022, 18:46.

  • #2
    For some reason I can't see this thread displayed in the latest topics. I hope anyone can see it.

    Comment


    • #3
      I'm sure it's possible, but why would you want to do this?

      Comment


      • #4
        The combination of a range bar and a scatter plot should do it.

        Code:
        help tw rbar
        help tw scatter
        Event does not vary for most countries in your example dataset, so I generate a toy dataset below:

        Code:
        clear
        set obs 200
        set seed 06242022
        gen event= rnormal(0.9,0.005)>0.9085
        gen countrycode = runiformint(1,4)
        *START HERE
        bys countrycode: egen mean= mean(event)
        bys countrycode: egen SD= sd(event)
        gen low= mean-SD
        gen high = mean + SD
        contract countrycode mean SD low high
        set scheme s1mono
        tw (rbar low high countrycode, horiz bfcolor(white) blc(black) barwidth(.5)) ///
        (scatter countrycode mean, msy(pipe) msize(12) mc(black)), yline(1 2 3 4) ///
        ylab(1 "Country 1" 2 "Country 2" 3 "Country 3" 4 "Country 4", angle(horiz)) ///
        leg(off) ytitle("") xlab(-0.3 " " 0 (.5) 1, noticks)
        Click image for larger version

Name:	Graph.png
Views:	1
Size:	9.9 KB
ID:	1670713


        Comment


        • #5
          Jared : I'd just like a visual way to represent both the mean and the standard deviation of my variable event by country. Perhaps there are simpler ways to do that, but if so, I do not know them.

          Andrew: Thank you, it's what I wanted! By any chance, is there a way to fill the boxes with different colors? I have a lot of countries and there is much less space than your example, so it can get confusing.
          Last edited by Julia Simon; 24 Jun 2022, 01:47.

          Comment


          • #6
            I really would not recommend this for any purpose unless you have some boss insisting on it, and I would still say she's misguided.

            Some fraction of your readers will be confused or worse by your borrowing box plot form for a summary with no relationship at all to conventions that boxes represent medians and quartiles.

            Worse, your convention leads to awkwardness if not absurdity -- or at least waste of space -- whenever mean MINUS SD is negative, as I think Andrew Musau is gently hinting in #3

            More positively, all the information here is carried by the mean (and the total number of values). The SD is a direct function of the mean as for indicator variables the SD is mean * (1 - mean), modulo small print about the divisor used in practice. The limiting cases are instructive as a mean of 0 or 1 can only happen if all values are 0 or 1, respectively, and then the SD is 0 in either case.

            There are at least two better alternatives in my view. One is to show means and the number of values in a hybrid of graph and table. The other is to show confidence intervals for the mean. There is no excuse to use any method that doesn't respect the bounds of 0 and 1. See

            help ci

            and the linked manual entry. I like the Jeffreys interval but the Wilson interval works fine too.

            Here I stole Andrew's sandbox and show two displays. Naturally some details in the graph code might need tweaking for your real dataset.

            Code:
            clear
            set obs 200
            set seed 06242022
            gen event= rnormal(0.9,0.005)>0.9085
            gen countrycode = runiformint(1,4)
            bys countrycode: egen mean= mean(event)
            by countrycode: gen tag = _n - 1 
            
            set scheme s1color 
            
            su countrycode, meanonly 
            forval j = 1/`r(max)' { 
                count if countrycode == `j' & event < . 
                label def countrycode `j' "`j' ({it:n} = `r(N)')", add 
            } 
            
            label val countrycode countrycode 
            
            scatter countrycode mean if tag, yla(, grid glc(gs12) valuelabel ang(h)) xsc(alt) xtitle(text about mean) legend(off) xla(#5) xline(0, lc(gs12)) xsc(r(0 .))
            
            preserve 
            
            statsby lb=r(lb) ub=r(ub) mean=r(mean), by(countrycode) clear: ci proportions event, jeffreys 
            
            scatter countrycode mean, yla(, grid glc(gs12) valuelabel ang(h)) xsc(alt) xtitle(text about mean) legend(off) xla(#5) xline(0, lc(gs12)) xsc(r(0 .)) || rcap lb ub countrycode, horizontal
            Click image for larger version

Name:	event_G1.png
Views:	1
Size:	18.6 KB
ID:	1670740


            Click image for larger version

Name:	event_G2.png
Views:	1
Size:	19.0 KB
ID:	1670741




            Comment


            • #7
              A canonical reference here is

              TY - JOUR
              TI - Interval Estimation for a Binomial Proportion
              AU - Lawrence D. Brown
              AU - T. Tony Cai
              AU - Anirban DasGupta
              T2 - Statistical Science
              VL - 16
              IS - 2
              JF - Statistical Science
              JO - Statistical Science
              SP - 101
              EP - 133

              UR - https://doi.org/10.1214/ss/1009213286
              PY - 2001/5/1
              DO - 10.1214/ss/1009213286
              ER -


              which led to StataCorp rewriting ci about 20 years ago.

              It's eye-opening about how traditional intervals can be awful and in explaining that there are much better alternatives.

              See also cij from SSC for the Jeffreys interval. Although that command is not needed now, as the idea has been folded into ci long since, the help file says more than does Stata's own documentation.

              Ditto ciw from SSC for the Wilson interval.

              So,

              Code:
              ssc type cij.hlp 
              
              ssc type ciw.hlp

              Comment

              Working...
              X