Assigning colors to box plot with stratification

Itai Magodoro

Join Date: Oct 2020

Posts: 26
#1

Assigning colors to box plot with stratification

16 Apr 2021, 07:23

Dear Forum,

I would like to box plot sCD163 over HIV status (positive vs. negative) (i) stratified by levels (tertiles) of cardiac function (tert_mv2) as (ii) well as overall, ie, just HIV positive vs HIV negative (total). Second - and for which I need some advice - I would like to color code the HIV status (positive = red; negative = blue).

I have the below copied box plot without colors written with the following command:

- graph box cd163, over(hiv) over(tert_mv2, total)

To add color, I have tried the following command which, unfortunately, then changes the plot as also copied below.

- graph box cd163, over(hiv) over(tert_mv2, total) nofill asyvars bar(1, color(blue)) bar(2, color(red))

I will very much appreciate any suggestions you may have on how to go about this.

Thanks in advance.

Itai
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10088

16 Apr 2021, 08:33

You can create a total category for the plot and label it so.

Code:

preserve
expand 2, g(new)
replace tert_mv2= 99 if new
graph box cd163, over(hiv) over(tert_mv2) nofill asyvars showyvars leg(off) nolab  bar(1, color(blue)) bar(2, color(red))
restore

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35438
#3

16 Apr 2021, 08:59

Andrew's approach is also written up at https://www.stata-journal.com/articl...article=gr0058

Otherwise my suggestions are chiefly to do something different.

1. Excluding outside values is like a red rag to a bull to many reviewers. https://www.merriam-webster.com/dict...0to%20a%20bull.

2. The generic skewness -- typical of measured concentration variables -- suggests trying a logarithmic scale.

3. In some cases there is a hint of bimodality which is hidden by the conventional box plot design and often not noticed by readers. On the leftmost box plot, no one is much surprised by the concentration between the minimum and the lower quartile -- that connotes right skewness -- but there is also concentration between the upper quartile and the maximum. What is going on there? Segue into...

4. More generally, the box plots give less detail than you have space to give. Elsewhere you may be giving more information, but the box plots don't give any indication of sample size or of detailed structure. I'd suggest something more like a dot plot or strip plot.

5. I don't think there is statistical or even clinical virtue in tertile bins. If you have a measure of cardiac function, why not use it directly? That will be a messy scatter plot, or two scatter plots as you have HIV status too, but that's most of the point, and doing that could be a complement to whatever scatter plot smoothing or modelling you also do.

It may or may not be relevant here, but I often see box plots used when the main analysis is in terms of means somehow -- some flavour of regression generally construed. Box plots can give you a good qualitative idea about a a distribution but graphing medians and quartiles with one hand and analysing in terms of means and variance or SD with the other hand should more often be thought odd.
Comment
Itai Magodoro

Join Date: Oct 2020

Posts: 26
#4

17 Apr 2021, 18:41

Andrew - Thank you. It worked like a charm.

Nick - Thank you for the reference as well as additional suggestions (# 1 - 4). The dotplot is actually more informative and visually addresses concern #3. Comment #5 - I do see your point. In fact, the analysis includes the measure of cardiac function as both a continous (scatter plot + fitting a line) and categorical (as tertiles) variable.

I have an additonal question (and I will presume to post it on this thread vs starting a new one as I think its related). I will appreciate further guidance/advice.

I would like to ammend command #1 (below) with the goal of reducing the y-axis to the range (-15 to -22).

#1 - graph bar gcs_2d if gcs_2d!=. & tert_mv2!=., over(hiv) over(tert_mv2) nofill asyvars showyvars leg(off) nolab bar(1, color(blue)) bar(2, color(red)) xalternate ytitle("Mean Peak GCS (%)")

I have attempted (#2 below) adding yscale(range(-15 -22)) to the command with no success. It returns the same plot as #1.

#2 - graph bar gcs_2d if gcs_2d!=. & tert_mv2!=., over(hiv) over(tert_mv2) nofill asyvars showyvars leg(off) nolab bar(1, color(blue)) bar(2, color(red)) xalternate ytitle("Mean Peak GCS (%)") yscale(range(-15 -22))

A last resort (command #3, below) was to exclude zero (exclude0) which, exasperatingly, gives the below plot, completely dropping one column (Lowest tertile/ Negative).

#3 - graph bar gcs_2d if gcs_2d!=. & tert_mv2!=., over(hiv) over(tert_mv2) nofill asyvars showyvars leg(off) nolab bar(1, color(blue)) bar(2, color(red)) xalternate ytitle("Mean Peak GCS (%)") exclude0

Thanks in advance.

Itai
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35438
#5

18 Apr 2021, 03:59

The documentation for axis scale options is adamant that specifying a range is not a way to exclude data. Specifying a base other than zero is also hard to defend without a substantive reason. (32 Fahrenheit being freezing would to me count as a substantive reason.)

But why use bar plots at all? The point is presumably not that the values are very similar, compared with zero, which they are, but to look at the differences between them, and graph dot is much more direct, Note its exclude0 option. Or you might as well just use a scatter plot.

This is a self-contained example with fake data in a similar range.

Code:

. clear . set obs 6 number of observations (_N) was 0, now 6 . gen which = _n . range whatever -18.8 -20.8 . graph dot (asis) whatever, over(which) exclude0 vertical scheme(s1color) l1title(whatever) linetype(line) lines(lc(gs12) lw(vthin)) yla(, ang(h)) . scatter whatever which , scheme(s1color) xla(, grid glc(gs12) glw(vthin))
Comment
Itai Magodoro

Join Date: Oct 2020

Posts: 26
#6

03 May 2021, 15:13

Thank you (belatedly) Nick for this explanation and example.

"Specifying a base other than zero is also hard to defend without a substantive reason. (32 Fahrenheit being freezing would to me count as a substantive reason.)"

Regarding exclusion of zero base on x-axis (your above comment), medical journals seem to often do so (I assume) to visually emphasize differences in outcomes that would not otherwise be as visually marked if zero is inlcuded on the axis. There is no other apparent substantive reason.

Below is a figure from the medical journal "Circulation" just as an example. Assuming I correctly understood your comment to begin with, my question then is: Is exclusion of zero base a case of "clinicians" getting away with an erroneous practice that has gained currency from use or it`s really matter of preference (vs. principle) whether or not the x-axis is zero based?

Thanks.
Itai

Attached Files
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35438
#7

03 May 2021, 17:24

Unless there is a clinical meaning to -8% as a base I say that example is hard to defend. The point of the graphic is comparison of values with each other, not with -8%, I presume.

A dot chart would be a better idea.

https://heart.bmj.com/content/102/5/349.short is a good paper that gets to the heart of the matter on graphics in cardiology.
Comment

Announcement

Assigning colors to box plot with stratification

Comment

Comment

Comment

Comment

Comment

Comment