Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to creat a box whisker plot that does consider outliers as well

    Dear all users, I am dealing with the following problem:
    I have to put in a graph the distribution of my outcome variable dividing it in subgroups defined by another variable (10 subgroups) so the option of a box whisker plot seemed perfect to me!
    The problem is that my outcome variable goes from 0 to 11, and only few (but still a considerable quantity) records have the greater values so when I try to graph it stata recognizes them as outliers but since they are not I would need considers them as the highest values of the range. How to do that?
    the code I use is:
    graph box var, over(independet var). I know that to get rid of outliers I shall use the option "nooutside" but on contrary I need to consider them all as part of the graph. Any idea how to fix it?

    thank you so much in advance!

  • #2
    This is rather difficult to follow, but I take it to mean that you don't want the usual rule for calculating whiskers, whereby data are plotted as points if they fall outside (lower quartile - 1.5 IQR, upper quartile + 1.5 IQR), but one that includes all the data regardless.

    I don't think graph box can help you and omitting data is certainly a bad idea.

    I have two alternatives.

    1. Use dotplot. You can show boxes as bars, and get richer detail.

    Here is an example:

    Code:
    sysuse auto, clear
    dotplot mpg, over(rep78) bar



    2. Use stripplot from SSC. stripplot supports whiskers defined by paired quantiles; you need to specify a small probability to get them covering the entire range.

    Code:
    stripplot mpg, over(rep78) box(barw(0.1)) pct(0.1) boffset(-0.15)  vertical stack height(0.4)

    Attached Files
    Last edited by Nick Cox; 01 May 2014, 05:22.

    Comment


    • #3
      [I was writing this while Nick posted his very helpful answer to your question]

      Before you try to create variations of standard boxplots (there are variations, I recommend to have a look at Wikipedia - not the best explanation - and at the Stata manual "[G-2] graph box" via help graph box), you should know how the box, the whiskers, and the outliers or extremes are usually defined. Next, have a look at

      Cox, N. (2009). Speaking Stata: Creating and varying box plots. The Stata Journal, 9(3), 478-496.

      Perhaps you missed the Statalist FAQ (#6) about using real names, first and last. To change your user name, contact the forum administrator (contact link at the bottom of the page).
      Last edited by Dirk Enzmann; 01 May 2014, 05:37.

      Comment


      • #4
        Thanks for the mention, but readers should note also the correction at http://www.stata-journal.com/article...ticle=gr0039_1

        Comment


        • #5
          How discrete is your outcome variable? If, between 0 and 11, it takes only integer values, the discreteness may cause boxplots to behave a little differently than if the variable were "continuous."

          A little correction on terminology. The data values plotted as individual points at the ends of a standard boxplot are "outside," but not necessarily outliers. In samples of well-behaved data, "outside" values are more frequent than the term "outlier" implies. The aim is to focus attention on those observations and invite the analyst to investigate them. The results of that investigation may justify calling some or all of those observations "outliers."

          If the sample sizes in the 10 subgroups are not too large, a dot plot (as Nick suggested) may be a good choice.

          If the sample sizes are quite large, pairwise quantile-quantile plots would compare the distributions in the subgroups.

          What aspects of the distribution of your outcome variable are most important?

          Comment


          • #6
            Originally posted by David Hoaglin View Post
            The data values plotted as individual points at the ends of a standard boxplot are "outside," but not necessarily outliers. In samples of well-behaved data, "outside" values are more frequent than the term "outlier" implies.
            Sure, thanks for correcting the terminology used.

            Comment


            • #7
              I just downloaded the stripplot from ssc and ran the code as presented in your example Nick Cox, but all my datapoints that are evenly spread out using dotplot (as in your example) lays ontop of eachother using stripplot (unlike what you show in your example). i wrote:

              Code:
              stripplot var1, over(group) box
              where group is 0 or 1.

              Best
              Lars

              Comment


              • #8
                note to self, it's all in the help files.
                Problem was the level of detail in my var1. All numbers were unique. So by using:
                Code:
                stripplot var1, over(group) vertical stack width(0.05)
                it worked out.

                Remember to read the help files. More than once.

                Comment


                • #9
                  Thanks for your self-correcting query. If the allusion was to #2 then I note that stack was explicit as an option.

                  Comment

                  Working...
                  X