Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Drawing a boxplot

    Hi Stata Users,

    I am using Stata v15 to look at the distribution of a continuous variable using boxplot command.
    The code is
    Code:
    graph box gvalue, over(year) horizontal
    And the output is
    Click image for larger version

Name:	boxplot.png
Views:	1
Size:	29.5 KB
ID:	1553039

    I am wondering why the box and whiskers aren’t visible. Is it because of some outliers?
    Is there an alternative to boxplot?
    Thanks in advance!

  • #2
    For distributions like yours, box plots are often puzzling or even misleading and best avoided.

    Look at your data -- in table form.

    Code:
    tabstat gvalue, by(year) c(s)  s(n min p25 p50 p75 max)
    will show that median and quartiles are close, possibly even tying. So the boxes will be short, possibly even of zero length. So 1.5 IQR will be short, possibly even zero. So whiskers will be short, even of zero length.

    You need to tell us more about the data: perhaps the data are sensitive or confidential, which is why you cropped the axis labels.

    Much hinges on whether the data are, or could be, negative, zero or positive. In practice, there are two very common cases.

    If gvalue is either positive or zero, it is possible that the best way to proceed is to show the fraction of zeros separately and plot the others on logarithmic scale. This is natural whenever the zeros are qualitatively as well as quantitatively different.

    If gvalue is always positive then you surely need logarithmic scale, but see https://www.stata.com/support/faqs/g...ithmic-scales/





    Comment


    • #3
      Thanks Nick Cox for your reply.

      The data takes only positively values so I believe transforming to logarithmic scale would be a feasible alternative.

      Comment


      • #4
        Pushing further with simulated data as the data behind #1 aren't available I used my personal default that the lognormal is likely to be a reference distribution here. This example is reproducible provided that you install stripplot from SSC.

        Code:
        clear
        set obs 1000
        egen year = seq(), block(200) from(2015) to(2019)
        set seed 2803
        gen whatever = exp(rnormal())
        set scheme s1color
        stripplot whatever, over(year) ysc(log) cumul cumprob vertical box(barw(0.1)) pctile(0) boffset(-0.1) yla(20 10 5 2 1 0.5 "0.5" 0.2 "0.2" 0.1 "0.1", ang(h)) xla(, tlcolor(none)) xtitle("")
        Click image for larger version

Name:	okiya.png
Views:	1
Size:	60.2 KB
ID:	1553095




        Detailed notes:

        1. As explained at length in the FAQ cited in #2 the rule "show points distinctly if they are more than 1.5 IQR from the nearer quartile" doesn't mesh well with logarithmic scale, but the whole problem can be avoided by using a different criterion for whiskers. In the plot here the whiskers just extend to the extremes. For data that are all positive, evidently minimum of logarithms = logarithm of minimum, and similarly for the maximum. For the median and quartiles, in practice summary of logarithms is close to or identical to logarithm of summary, the small print being whenever a median or a quartlle is interpolated on the original scale.

        2. A skeletal box plot is in my view defensible, even preferable, when a quantile plot alongside gives more detail about the distribution.

        3. Connoisseurs of fine detail will note the trick of making a tick invisible by giving it no colour.

        4. Naturally the data for different years are drawn from the same distribution in this case, but Stephen's data are likely to be more interesting.

        5. My bias is that readers who need to be told that 2015 to 2019 represent the year are in the wrong place somehow. Hence I zapped the axis title.

        6. Despite the author of stripplot I don't come up with detailed option choices without a certain amount of fooling around to find out what works. That is likely to be true for anyone else with different data.
        Last edited by Nick Cox; 14 May 2020, 05:31.

        Comment


        • #5
          Thanks Nick Cox for the detailed feedback on an alternative approach to using boxplot. I have replicated your example and it works perfect.

          Thanks once again!!

          Comment

          Working...
          X