How to do a Box Plot with mean instead of median and SD instead of quartiles?

Julia Simon

Join Date: Apr 2022

Posts: 37
#1

How to do a Box Plot with mean instead of median and SD instead of quartiles?

23 Jun 2022, 18:43

Dear Statalisters,

Please have a look at my data:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float event byte countrycode 0 2 0 4 0 7 0 7 0 5 1 3 0 9 1 4 0 8 0 5 0 8 0 5 0 5 0 5 0 5 0 3 0 5 0 3 0 5 0 7 end

event is a binary variable. I'd like to generate some visual descriptive statistics for event. Given that it is a binary variable, a box plot is unappropriate as there would be no boundaries. But I would like something that looks like a box plot, showing the mean where it traditionally shows the median, and the borders would be the mean +/- the standard deviation. To make things look clearer, here's an image of how it would look :

For each country code, the center of the box would be the mean of event, the left boundary of the box would be the mean - sd of event, and the left boundary of the box would be the mean + the sd of event, so the mean would be at the center of each box.

Is there a way to generate such plots? I tried the command -graph box- but my attempts remained unsuccessful. Any help would be appreciated. Thanks a lot!

Last edited by Julia Simon; 23 Jun 2022, 18:46.
Tags: None
Julia Simon

Join Date: Apr 2022

Posts: 37
#2

23 Jun 2022, 18:55

For some reason I can't see this thread displayed in the latest topics. I hope anyone can see it.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#3

23 Jun 2022, 19:52

I'm sure it's possible, but why would you want to do this?
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10190

23 Jun 2022, 19:59

The combination of a range bar and a scatter plot should do it.

Code:

help tw rbar
help tw scatter

Event does not vary for most countries in your example dataset, so I generate a toy dataset below:

Code:

clear
set obs 200
set seed 06242022
gen event= rnormal(0.9,0.005)>0.9085
gen countrycode = runiformint(1,4)
*START HERE
bys countrycode: egen mean= mean(event)
bys countrycode: egen SD= sd(event)
gen low= mean-SD
gen high = mean + SD
contract countrycode mean SD low high
set scheme s1mono
tw (rbar low high countrycode, horiz bfcolor(white) blc(black) barwidth(.5)) ///
(scatter countrycode mean, msy(pipe) msize(12) mc(black)), yline(1 2 3 4) ///
ylab(1 "Country 1" 2 "Country 2" 3 "Country 3" 4 "Country 4", angle(horiz)) ///
leg(off) ytitle("") xlab(-0.3 " " 0 (.5) 1, noticks)

Click image for larger version

Name: Graph.png
Views: 1
Size: 9.9 KB
ID: 1670713

Comment

Julia Simon

Join Date: Apr 2022

Posts: 37
#5

24 Jun 2022, 01:44

Jared : I'd just like a visual way to represent both the mean and the standard deviation of my variable event by country. Perhaps there are simpler ways to do that, but if so, I do not know them.

Andrew: Thank you, it's what I wanted! By any chance, is there a way to fill the boxes with different colors? I have a lot of countries and there is much less space than your example, so it can get confusing.

Last edited by Julia Simon; 24 Jun 2022, 01:47.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#6

24 Jun 2022, 03:57

I really would not recommend this for any purpose unless you have some boss insisting on it, and I would still say she's misguided.

Some fraction of your readers will be confused or worse by your borrowing box plot form for a summary with no relationship at all to conventions that boxes represent medians and quartiles.

Worse, your convention leads to awkwardness if not absurdity -- or at least waste of space -- whenever mean MINUS SD is negative, as I think Andrew Musau is gently hinting in #3

More positively, all the information here is carried by the mean (and the total number of values). The SD is a direct function of the mean as for indicator variables the SD is mean * (1 - mean), modulo small print about the divisor used in practice. The limiting cases are instructive as a mean of 0 or 1 can only happen if all values are 0 or 1, respectively, and then the SD is 0 in either case.

There are at least two better alternatives in my view. One is to show means and the number of values in a hybrid of graph and table. The other is to show confidence intervals for the mean. There is no excuse to use any method that doesn't respect the bounds of 0 and 1. See

help ci

and the linked manual entry. I like the Jeffreys interval but the Wilson interval works fine too.

Here I stole Andrew's sandbox and show two displays. Naturally some details in the graph code might need tweaking for your real dataset.

Code:

clear set obs 200 set seed 06242022 gen event= rnormal(0.9,0.005)>0.9085 gen countrycode = runiformint(1,4) bys countrycode: egen mean= mean(event) by countrycode: gen tag = _n - 1 set scheme s1color su countrycode, meanonly forval j = 1/`r(max)' { count if countrycode == `j' & event < . label def countrycode `j' "`j' ({it:n} = `r(N)')", add } label val countrycode countrycode scatter countrycode mean if tag, yla(, grid glc(gs12) valuelabel ang(h)) xsc(alt) xtitle(text about mean) legend(off) xla(#5) xline(0, lc(gs12)) xsc(r(0 .)) preserve statsby lb=r(lb) ub=r(ub) mean=r(mean), by(countrycode) clear: ci proportions event, jeffreys scatter countrycode mean, yla(, grid glc(gs12) valuelabel ang(h)) xsc(alt) xtitle(text about mean) legend(off) xla(#5) xline(0, lc(gs12)) xsc(r(0 .)) || rcap lb ub countrycode, horizontal
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#7

24 Jun 2022, 05:33

A canonical reference here is

TY - JOUR
TI - Interval Estimation for a Binomial Proportion
AU - Lawrence D. Brown
AU - T. Tony Cai
AU - Anirban DasGupta
T2 - Statistical Science
VL - 16
IS - 2
JF - Statistical Science
JO - Statistical Science
SP - 101
EP - 133

UR - https://doi.org/10.1214/ss/1009213286
PY - 2001/5/1
DO - 10.1214/ss/1009213286
ER -

which led to StataCorp rewriting ci about 20 years ago.

It's eye-opening about how traditional intervals can be awful and in explaining that there are much better alternatives.

See also cij from SSC for the Jeffreys interval. Although that command is not needed now, as the idea has been folded into ci long since, the help file says more than does Stata's own documentation.

Ditto ciw from SSC for the Wilson interval.

So,

Code:

ssc type cij.hlp ssc type ciw.hlp
1 like
Comment

Announcement

How to do a Box Plot with mean instead of median and SD instead of quartiles?

Comment

Comment

Comment

Comment

Comment

Comment