Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Box-whisker-plot in Stata (default): What is the Whisker? 67% of what? not 1.5 inter quartile range (IQR)?

    Stata (Version 15) help for "graph box" says
    alsize(#) width of adjacent line; default is 67
    and in more detail:

    alsize(#) ... specify the width of the adjacent line... You may specify these options whether or not you
    specify cwhiskers. alsize() ... specified in percentage-of-box-width units; the defaults are alsize(67). Thus the adjacent lines extend
    two-thirds the width of a box ...
    (Omitting the text on the caps.)

    However, when reading Nick Cox's "Speaking Stata: Creating and varying box plots
    " The Stata Journal (2009) 9, Number 3, pp. 478–496 it says on page
    graph box and graph hbox by default follow what is perhaps the most common recipe (Tukey 1977):
    1. Lines, often called whiskers, are drawn to span all data points within 1.5 IQR of the nearer quartile. That is, one whisker extends to include all data points within 1.5 IQR of the upper quartile and stops at the largest such value, while the other whisker extends to include all data within 1.5 IQR of the lower quartile and stops at the smallest such value. Tukey called the outer limits of the whiskers adjacent values.
    However, I do not know if since 2009 the defaults in Stata changed. Can anyone enlighten me, please?

  • #2
    The "adjacent line" is the horizontal line at the end of a vertical whisker, and the size set by the alsize() option is how wide the "adjacent line" is relative to the width of the box from which the whisker springs.

    The alsize() option does not control the length of the whiskers, as I think you misunderstood, and what Nick Cox wrote in 2009 regarding the length of the whiskers remains correct.

    From the output of help graph bar (coloring added)
    Code:
                                      o     <- outside values
                                      o
    
               adjacent line  --+     -     <- upper adjacent value
                                |     |
                      whiskers  |     |
                               -|   +---+   <- 75th percentile (upper hinge)
                                |   |   |
                         box    |   |---|   <- median
                                |   |   |
                               -|   +---+   <- 25th percentile (lower hinge)
                      whiskers  |     |
                                |     |
               adjacent line  --+     -     <- lower adjacent value
    
                                      o     <- outside value
    Last edited by William Lisowski; 16 Dec 2020, 09:26.

    Comment


    • #3
      There is no contradiction here. alsize() is solely about how a line is rendered and nothing at all to do with where it is placed given a magnitude axis.

      Not the question, but it is 43 years at least since the 1.5 IQR rule of thumb was suggested by Tukey (most famously included in Exploratory data analysis, 1977) as a convention for which data points, if any, are shown individually, and which others, if any, are implied by whiskers.

      Tukey's context was explicitly methods that could be implemented by hand -- meaning, literally, with graph paper and coloured pens or pencils -- and with minimal arithmetic -- for small or moderate size datasets. If that's anybody's practical context, the advice might still seem compelling, but now

      1. We aren't limited to such methods!

      2. There is, I suggest, much practical experience that

      a. The Tukey rules have often proved harder to explain, memorise and understand than Tukey might have hoped.

      b. It's often a good idea to show much more detail, not only because you can but also it usually helps and almost never hinders.

      I tend to draw boxes alongside more detailed representations of the data -- using stripplot from SSC -- and to draw whiskers either to the extremes or to selected percentiles, depending on the size of the dataset and what seems to work best, given whatever mix of science, statistics, and suspicion underlies the analysis.

      Comment


      • #4
        The option alsize is irrelevant to what you want to know, it influences the way the line is drawn, not how long it is.

        If you want to know such things you look at the manual (which is the pdf file to which the help file links to) and go to the section methods and formulas. This entry shows that Stata uses the description used by Nick (and goes back to Tukey). This definition was at least from Stata 11 (the first version of Stata that shipped with the manuals as pdfs), but presumably from the first time box graphs were included in Stata, as this is the standard deviation of those whiskers.
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          Originally posted by William Lisowski View Post
          The "adjacent line" is the horizontal line at the end of a vertical whisker, and the size set by the alsize() option is how wide the "adjacent line" is relative to the width of the box from which the whisker springs.

          The alsize() option does not control the length of the whiskers, as I think you misunderstood, and what Nick Cox wrote in 2009 regarding the length of the whiskers remains correct.

          From the output of help graph bar (coloring added)
          Code:
          o <- outside values
          o
          
          adjacent line --+ - <- upper adjacent value
          | |
          whiskers | |
          -| +---+ <- 75th percentile (upper hinge)
          | | |
          box | |---| <- median
          | | |
          -| +---+ <- 25th percentile (lower hinge)
          whiskers | |
          | |
          adjacent line --+ - <- lower adjacent value
          
          o <- outside value
          Thanks. Yes I was confused by that. I should have tried with extreme values - I don't know why I didn't. Thank you.

          Would have been awesome if the whisker definition were to be found anywhere in the help :D

          Comment


          • #6
            Originally posted by Maarten Buis View Post
            The option alsize is irrelevant to what you want to know, it influences the way the line is drawn, not how long it is.

            If you want to know such things you look at the manual (which is the pdf file to which the help file links to) and go to the section methods and formulas. This entry shows that Stata uses the description used by Nick (and goes back to Tukey). This definition was at least from Stata 11 (the first version of Stata that shipped with the manuals as pdfs), but presumably from the first time box graphs were included in Stata, as this is the standard deviation of those whiskers.
            Oh, I had no idea that the PDF gives more help - I always assumed it has the same as the help you get when you type h(elp) something.

            Thanks!

            Comment


            • #7
              Oh, I had no idea that the PDF gives more help - I always assumed it has the same as the help you get when you type h(elp) something.
              In the output of help for many commands, you will see a section much like this one, copied from the output of help graph box. In Stata the three entries are active links to open the PDF documentation to the given section.
              Code:
              Links to PDF documentation
              
                      Quick start
              
                      Remarks and examples
              
                      Methods and formulas
              
                  The above sections are not included in this help file.

              Comment


              • #8
                Originally posted by Andrea Maier View Post

                Oh, I had no idea that the PDF gives more help - I always assumed it has the same as the help you get when you type h(elp) something.
                Oh, you are in for a treat! There is so much more useful information available then you thought. That should keep you reading during the coming holliday ( and many years to come ... )
                ---------------------------------
                Maarten L. Buis
                University of Konstanz
                Department of history and sociology
                box 40
                78457 Konstanz
                Germany
                http://www.maartenbuis.nl
                ---------------------------------

                Comment


                • #9
                  Some years ago, being unaware of Nick Cox 's 2004 program adjacent, I wrote sumadj, which adds the adjacent values to the output of summarize. But for expenditure or price data, which are typically positively skewed, box-and-whiskers plots can be unsatisfying because they hide the detail in lower end of the distribution. To respond to this concern, the sumadj program with options graph ylog produces a a box-and-whisker plot of the logarithmic transform of the variable. The alpha option replaces the numeric labels with words to identify the various "whiskers" about which I have received so many questions from audiences. These commands:

                  Code:
                  net install sumadj, from("http://digital.cgdev.org/doc/stata/MO/Misc")
                  sysuse auto
                  sumadj price, graph name(plain)
                  sumadj price, graph ylog alpha name(ylog)
                  gr combine plain ylog
                  produce this pair of annotated box-and-whisker plots.



                  Ben Jann has recently published a program called robbox (for "robust box") which provides new tools for constructing and displaying box-and-whisker plus. While robbox does not have its own options to display the vertical axis on a log scale or to label the potentially mysterious whiskers, the same functionality can be achieved by executing robbox twice as follows:

                  Code:
                  ssc install robbox
                  sysuse auto, clear
                  robbox price, standard name(plain, replace)
                      local median = e(b)[1,1]
                      local lo_q = e(box)[1,1]
                      local up_q = e(box)[2,1]
                      local lo_w = e(whiskers)[1,1]
                      local up_w = e(whiskers)[2,1]
                      
                  robbox price, standard yscale(log)  ///
                      ylabel(  ///
                          `lo_w' "Lower Adj."  ///
                          `lo_q' "25th %ile"   ///
                          `median' "Median"    ///
                          `up_q' "75th %ile"   ///
                          `up_w' "Upper Adj."  ///
                      , angle(hor) )  ///
                      ytitle("log(Price)")  ///
                      name(robbox_log, replace)
                  
                  gr combine plain robbox_log
                  The left panel of the following combined graph show's robbox's plot of the auto.dta's price variable which is comparable to Stata's default box-and-whisker plot. The right panel displays the same variable, but applies Stata's yscale(log) and ylabel() options to achieve a cleaner version of the right panel from sumadj above.

                  Click image for larger version

Name:	robbox_demo.png
Views:	1
Size:	113.7 KB
ID:	1617530
                  Attached Files

                  Comment


                  • #10
                    Obviously I haven't mastered posting image files. Please disregard the duplicate posting of the sumadj graph and instead click on the thumbnail which does indeed show the two robbox graphs.

                    Comment


                    • #11
                      Thank you so much! I have no idea why I haven't seen that any earlier, but this is very important. Thank you!

                      Comment


                      • #12
                        You will get email notification of posts to a thread if and only if you subscribe to that thread.

                        Comment


                        • #13
                          Oh, I thought that is default when you author a thread. Thanks for clarification, Nick Cox!

                          Comment

                          Working...
                          X