Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Assigning colors to box plot with stratification

    Dear Forum,

    I would like to box plot sCD163 over HIV status (positive vs. negative) (i) stratified by levels (tertiles) of cardiac function (tert_mv2) as (ii) well as overall, ie, just HIV positive vs HIV negative (total). Second - and for which I need some advice - I would like to color code the HIV status (positive = red; negative = blue).

    I have the below copied box plot without colors written with the following command:

    - graph box cd163, over(hiv) over(tert_mv2, total)

    To add color, I have tried the following command which, unfortunately, then changes the plot as also copied below.

    - graph box cd163, over(hiv) over(tert_mv2, total) nofill asyvars bar(1, color(blue)) bar(2, color(red))


    I will very much appreciate any suggestions you may have on how to go about this.

    Thanks in advance.

    Itai



    Click image for larger version

Name:	Graph Box.png
Views:	3
Size:	61.1 KB
ID:	1603963


    Click image for larger version

Name:	Plot 2.png
Views:	2
Size:	45.3 KB
ID:	1603964

  • #2
    You can create a total category for the plot and label it so.

    Code:
    preserve
    expand 2, g(new)
    replace tert_mv2= 99 if new
    graph box cd163, over(hiv) over(tert_mv2) nofill asyvars showyvars leg(off) nolab  bar(1, color(blue)) bar(2, color(red))
    restore

    Comment


    • #3
      Andrew's approach is also written up at https://www.stata-journal.com/articl...article=gr0058

      Otherwise my suggestions are chiefly to do something different.

      1. Excluding outside values is like a red rag to a bull to many reviewers. https://www.merriam-webster.com/dict...0to%20a%20bull.

      2. The generic skewness -- typical of measured concentration variables -- suggests trying a logarithmic scale.

      3. In some cases there is a hint of bimodality which is hidden by the conventional box plot design and often not noticed by readers. On the leftmost box plot, no one is much surprised by the concentration between the minimum and the lower quartile -- that connotes right skewness -- but there is also concentration between the upper quartile and the maximum. What is going on there? Segue into...

      4. More generally, the box plots give less detail than you have space to give. Elsewhere you may be giving more information, but the box plots don't give any indication of sample size or of detailed structure. I'd suggest something more like a dot plot or strip plot.

      5. I don't think there is statistical or even clinical virtue in tertile bins. If you have a measure of cardiac function, why not use it directly? That will be a messy scatter plot, or two scatter plots as you have HIV status too, but that's most of the point, and doing that could be a complement to whatever scatter plot smoothing or modelling you also do.

      It may or may not be relevant here, but I often see box plots used when the main analysis is in terms of means somehow -- some flavour of regression generally construed. Box plots can give you a good qualitative idea about a a distribution but graphing medians and quartiles with one hand and analysing in terms of means and variance or SD with the other hand should more often be thought odd.

      Comment


      • #4
        Andrew - Thank you. It worked like a charm.

        Nick - Thank you for the reference as well as additional suggestions (# 1 - 4). The dotplot is actually more informative and visually addresses concern #3. Comment #5 - I do see your point. In fact, the analysis includes the measure of cardiac function as both a continous (scatter plot + fitting a line) and categorical (as tertiles) variable.

        I have an additonal question (and I will presume to post it on this thread vs starting a new one as I think its related). I will appreciate further guidance/advice.

        I would like to ammend command #1 (below) with the goal of reducing the y-axis to the range (-15 to -22).

        #1 - graph bar gcs_2d if gcs_2d!=. & tert_mv2!=., over(hiv) over(tert_mv2) nofill asyvars showyvars leg(off) nolab bar(1, color(blue)) bar(2, color(red)) xalternate ytitle("Mean Peak GCS (%)")

        I have attempted (#2 below) adding yscale(range(-15 -22)) to the command with no success. It returns the same plot as #1.

        #2 - graph bar gcs_2d if gcs_2d!=. & tert_mv2!=., over(hiv) over(tert_mv2) nofill asyvars showyvars leg(off) nolab bar(1, color(blue)) bar(2, color(red)) xalternate ytitle("Mean Peak GCS (%)") yscale(range(-15 -22))

        A last resort (command #3, below) was to exclude zero (exclude0) which, exasperatingly, gives the below plot, completely dropping one column (Lowest tertile/ Negative).

        #3 - graph bar gcs_2d if gcs_2d!=. & tert_mv2!=., over(hiv) over(tert_mv2) nofill asyvars showyvars leg(off) nolab bar(1, color(blue)) bar(2, color(red)) xalternate ytitle("Mean Peak GCS (%)") exclude0


        Thanks in advance.

        Itai

        Click image for larger version

Name:	GSC inc vs exc0.png
Views:	1
Size:	63.5 KB
ID:	1604185


        Comment


        • #5
          The documentation for axis scale options is adamant that specifying a range is not a way to exclude data. Specifying a base other than zero is also hard to defend without a substantive reason. (32 Fahrenheit being freezing would to me count as a substantive reason.)

          But why use bar plots at all? The point is presumably not that the values are very similar, compared with zero, which they are, but to look at the differences between them, and graph dot is much more direct, Note its exclude0 option. Or you might as well just use a scatter plot.

          This is a self-contained example with fake data in a similar range.

          Code:
          . clear
          
          . set obs 6
          number of observations (_N) was 0, now 6
          
          . gen which = _n
          
          . range whatever -18.8 -20.8
          
          . graph dot (asis) whatever, over(which) exclude0 vertical scheme(s1color) l1title(whatever)  linetype(line) lines(lc(gs12) lw(vthin)) yla(, ang(h))
          
          . scatter whatever which , scheme(s1color) xla(, grid glc(gs12) glw(vthin))

          Comment


          • #6
            Thank you (belatedly) Nick for this explanation and example.

            "Specifying a base other than zero is also hard to defend without a substantive reason. (32 Fahrenheit being freezing would to me count as a substantive reason.)"

            Regarding exclusion of zero base on x-axis (your above comment), medical journals seem to often do so (I assume) to visually emphasize differences in outcomes that would not otherwise be as visually marked if zero is inlcuded on the axis. There is no other apparent substantive reason.

            Below is a figure from the medical journal "Circulation" just as an example. Assuming I correctly understood your comment to begin with, my question then is: Is exclusion of zero base a case of "clinicians" getting away with an erroneous practice that has gained currency from use or it`s really matter of preference (vs. principle) whether or not the x-axis is zero based?

            Thanks.
            Itai

            Click image for larger version

Name:	stataforum.jpg
Views:	1
Size:	60.9 KB
ID:	1607498
            Attached Files

            Comment


            • #7
              Unless there is a clinical meaning to -8% as a base I say that example is hard to defend. The point of the graphic is comparison of values with each other, not with -8%, I presume.

              A dot chart would be a better idea.

              https://heart.bmj.com/content/102/5/349.short is a good paper that gets to the heart of the matter on graphics in cardiology.

              Comment

              Working...
              X