Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help: Histogram with mean and standard deviation overlayed

    Hello, I am using Stata 14.2

    I would like a histogram of mean(intercepts) for my metabolite, with the overall mean and standard deviation overlayed.
    Here are my data points:

    sum metabolite

    Variable | Obs Mean Std. Dev. Min Max
    -------------+---------------------------------------------------------
    metabolite | 10,728 5.648804 4.412839 0 12.38498

    I understand the hist command, and I have used the drop down menu graphics ->histogram
    where I see an "add plots" option which includes an option for "median band-line"

    I have search the FAQ, previous posts, and also the help menu/manual.

    I attach an example of a histogram with overall mean and SD overlayed (created using SAS).

    How do I replicate this using Stata?

    Thank-you!!!!
    Lynn



    Attached Files

  • #2
    I wonder whether this post gives the solution you wish.
    Best regards,

    Marcos

    Comment


    • #3
      Thank-you Marcos. I will try Best, Lynn

      Comment


      • #4
        Hi Marcos,
        Can you help me understand this line of code:

        text(0.12 `m' `"mean = $`=string(`m',"%6.2f")'"', ///

        When I run the code I get the following error message:
        type mismatch
        invalid point, mean = $ 0.12


        Here is the full code for the example:
        sysuse nlsw88, clear

        summarize wage
        local m=r(mean)
        local sd=r(sd)
        local low = `m'-`sd'
        local high=`m'+`sd'

        twoway histogram wage , ///
        fc(none) lc(green) xline(`m') ///
        xline(`low', lc(blue)) xline(`high', lc(blue)) scale(0.5) ///
        text(0.12 `m' `"mean = $`=string(`m',"%6.2f")'"', ///
        color(red) orientation(vertical) placement(2))



        Thank-you, Lynn

        Comment


        • #5
          Works for me. Make sure you run the code as a block, not e.g. line by line from a do-file editor.

          Comment


          • #6
            Yes, thank-you. This works by running in a block. I ran line by line.
            One more question - the mean and SD lines appear vertically (attached as Stata vertical).
            Is it possible to have the mean and SD appear horizontally at the base of the histogram (attached as SAS horizontal).
            Attached Files

            Comment


            • #7
              This shows some technique:


              Code:
              sysuse nlsw88, clear
              
              summarize wage
              gen low = r(mean) - r(sd) 
              gen high = r(mean) + r(sd) 
              gen where = -0.005
              
              twoway histogram wage , ///
              fc(none) lc(green) xtitle("`: var label wage'") ytitle(density) /// 
              || rbar low high where, horiz barw(0.005) legend(off)

              Comment


              • #8
                Thanks very much for resolving this challenge. With gratitude, Lynn

                Comment


                • #9
                  These "mixed" graphs (shown in #1 as taken from SAS) are also frequently found in R.

                  The commands in #7 are quite an achievement! Surely, to saved and used accordingly by Stata users.

                  That said, and just as a side note: the histogram in #6 points out to a (rather) negatively-skewed variable. Moreover, it seems the "log2+1" transformation didn't help much to "normalize" it.

                  Median and IQRs tend to perform better under such scenario. Besides, the "natural" variable can be kept under its pristine condition. Being this the case, boxplots clearly outperform histograms. To end, the mean could be spotted as well, shall a "mixed" graph be chosen.
                  Last edited by Marcos Almeida; 27 Mar 2017, 09:21.
                  Best regards,

                  Marcos

                  Comment


                  • #10
                    Hi Marcos,
                    Thanks for this. I appreciate your input. My metabolite is cotinine and therefore varies by smoking status which is why I do not have a normal distribution. I will also produce some box plots as you suggest. I suspect boxplots may perform better. Indeed I am working between Stata and R, although R graphics are proving to be a bit of a learning curve and as always, the output from these results are urgent Best wishes, Lynn

                    Comment


                    • #11
                      Thanks a lot to Nick for the inspiration - I must say I've already learnt a lot from him.
                      This is my first statalist post so apologies for any mistakes in the way you quote/post code or reply

                      I thought I'd just share the following bit of code which I built based on Nick's code.
                      I had this idea of creating some code that would create a histogram with all the important points (means, medians, different ranges (like (y-sd, y+sd) or (p25, p75) that you are usually interested in when exploring a dataset. So I wrote the thing below and I thought I'd share in case anyone finds it useful.


                      Cheers everyone!

                      Code:
                      set scheme vg_rose
                      webuse grunfeld, clear
                      
                      gen m_sd=.
                      gen msd=.
                      gen m_2sd=.
                      gen m2sd=.
                      gen m_3sd=.
                      gen m3sd=.
                      gen where=.
                      gen where2=.
                      gen per25=.
                      gen per75=.
                      gen where3=.
                      foreach var of varlist mvalue kstock {
                      sum `var', d
                      local mean=r(mean)
                      local median=r(p50)
                      local p25=r(p25)
                      local p75=r(p75)
                      local p10=r(p10)
                      local p90=r(p90)
                      local max=r(max)
                      local min=r(min)
                      replace m_sd= r(mean) - r(sd)
                      replace msd = r(mean) + r(sd)
                      replace m_2sd = r(mean) - 2*r(sd)
                      replace m2sd= r(mean) + 2*r(sd)
                      replace m_3sd = r(mean) - 3*r(sd)
                      replace m3sd= r(mean) + 3*r(sd)
                      replace where = -0.3
                      replace where2 = -0.6
                      replace where3=-0.9
                      replace per25=r(p25)
                      replace per75=r(p75)
                      twoway (hist `var', percent xaxis(1 2) fcolor(grey%30) lcolor(grey%1) bin(50) xtitle(`var') ytitle(Percent) ///
                      xline(`p10' `p90', lwidth(0.2) lpattern(dash) noextend) ///
                      xline(`p25' `p75', lwidth(0.2) lcolor(orange%50) noextend) ///
                      xline(`mean', lwidth(0.5) lcolor(black%60) noextend) ///
                      xline(`median', lwidth(0.5) lcolor(black%80) noextend) ///
                      xlabel(`p10' "p10" `p25' "p25" `p75' "p75" `p90' "p90" `mean' "mean" `median' "median", axis(2) labcolor(black) labsize(vsmall) angle(65) alternate ) ///
                      xlabel(`p10' `p25' `mean' `median' `p75' `p90', format(%9.2f) axis(1) labsize(2.5) angle(65) alternate ) ///
                      xscale(noline axis(2))) || ///
                      rbar m_sd msd where , horiz barw(0.2) legend(off) || ///
                      rbar m_2sd m2sd where2, horiz barw(0.2) legend(off) || ///
                      rbar m_3sd m3sd where3, horiz barw(0.2) legend(off)
                      }
                      Originally posted by Nick Cox View Post
                      This shows some technique:


                      Code:
                      sysuse nlsw88, clear
                      
                      summarize wage
                      gen low = r(mean) - r(sd)
                      gen high = r(mean) + r(sd)
                      gen where = -0.005
                      
                      twoway histogram wage , ///
                      fc(none) lc(green) xtitle("`: var label wage'") ytitle(density) ///
                      || rbar low high where, horiz barw(0.005) legend(off)

                      Comment


                      • #12
                        Nick:

                        What does where= -0.005 mean in post # 7?

                        What does it represent?

                        Comment


                        • #13

                          The variable where gives the vertical position of a bar underneath the histogram. Its value is constant — because the bar is horizontal; negative — because the bar is to go below the horizontal axis, which is at y = 0; and small — because it is to go just below the axis.

                          The precise value depends on the range on the y axis, which here shows probability density. Depending on that range you might need a different value.

                          The code is deliberately reproducible, assuming only a standard Stata installation, so that you can run it to see what happens.
                          Last edited by Nick Cox; 23 Sep 2019, 18:11.

                          Comment


                          • #14
                            I would like to create this type of graph, for a specific variable (e.g. salary) where:

                            a) the vertical lines mark the mean, the median and a specific value (set 'manually' by me).

                            b) Those are values to be labeled in the graph as: 'mean', 'median' and 'minimum value'.


                            I tried to create these from the commands suggested above, without any success....
                            Attached Files

                            Comment

                            Working...
                            X