Help: Histogram with mean and standard deviation overlayed

Lynn Larose

Join Date: Feb 2017

Posts: 15
#1

Help: Histogram with mean and standard deviation overlayed

27 Mar 2017, 07:07

Hello, I am using Stata 14.2

I would like a histogram of mean(intercepts) for my metabolite, with the overall mean and standard deviation overlayed.
Here are my data points:

sum metabolite

Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
metabolite | 10,728 5.648804 4.412839 0 12.38498

I understand the hist command, and I have used the drop down menu graphics ->histogram
where I see an "add plots" option which includes an option for "median band-line"

I have search the FAQ, previous posts, and also the help menu/manual.

I attach an example of a histogram with overall mean and SD overlayed (created using SAS).

How do I replicate this using Stata?

Thank-you!!!!
Lynn

Attached Files
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

27 Mar 2017, 07:18

I wonder whether this post gives the solution you wish.

Best regards,

Marcos
Comment
Lynn Larose

Join Date: Feb 2017

Posts: 15
#3

27 Mar 2017, 07:43

Thank-you Marcos. I will try Best, Lynn
Comment
Lynn Larose

Join Date: Feb 2017

Posts: 15
#4

27 Mar 2017, 07:54

Hi Marcos,
Can you help me understand this line of code:

text(0.12 `m' `"mean = $`=string(`m',"%6.2f")'"', ///

When I run the code I get the following error message:
type mismatch
invalid point, mean = $ 0.12

Here is the full code for the example:
sysuse nlsw88, clear

summarize wage
local m=r(mean)
local sd=r(sd)
local low = `m'-`sd'
local high=`m'+`sd'

twoway histogram wage , ///
fc(none) lc(green) xline(`m') ///
xline(`low', lc(blue)) xline(`high', lc(blue)) scale(0.5) ///
text(0.12 `m' `"mean = $`=string(`m',"%6.2f")'"', ///
color(red) orientation(vertical) placement(2))

Thank-you, Lynn
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#5

27 Mar 2017, 08:06

Works for me. Make sure you run the code as a block, not e.g. line by line from a do-file editor.
Comment
Lynn Larose

Join Date: Feb 2017

Posts: 15
#6

27 Mar 2017, 08:40

Yes, thank-you. This works by running in a block. I ran line by line.
One more question - the mean and SD lines appear vertically (attached as Stata vertical).
Is it possible to have the mean and SD appear horizontally at the base of the histogram (attached as SAS horizontal).

Attached Files
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35696

27 Mar 2017, 08:52

This shows some technique:

Code:

sysuse nlsw88, clear

summarize wage
gen low = r(mean) - r(sd) 
gen high = r(mean) + r(sd) 
gen where = -0.005

twoway histogram wage , ///
fc(none) lc(green) xtitle("`: var label wage'") ytitle(density) /// 
|| rbar low high where, horiz barw(0.005) legend(off)

Comment

Lynn Larose

Join Date: Feb 2017

Posts: 15
#8

27 Mar 2017, 09:12

Thanks very much for resolving this challenge. With gratitude, Lynn
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#9

27 Mar 2017, 09:19

These "mixed" graphs (shown in #1 as taken from SAS) are also frequently found in R.

The commands in #7 are quite an achievement! Surely, to saved and used accordingly by Stata users.

That said, and just as a side note: the histogram in #6 points out to a (rather) negatively-skewed variable. Moreover, it seems the "log2+1" transformation didn't help much to "normalize" it.

Median and IQRs tend to perform better under such scenario. Besides, the "natural" variable can be kept under its pristine condition. Being this the case, boxplots clearly outperform histograms. To end, the mean could be spotted as well, shall a "mixed" graph be chosen.

Last edited by Marcos Almeida; 27 Mar 2017, 09:21.

Best regards,

Marcos
Comment
Lynn Larose

Join Date: Feb 2017

Posts: 15
#10

27 Mar 2017, 09:27

Hi Marcos,
Thanks for this. I appreciate your input. My metabolite is cotinine and therefore varies by smoking status which is why I do not have a normal distribution. I will also produce some box plots as you suggest. I suspect boxplots may perform better. Indeed I am working between Stata and R, although R graphics are proving to be a bit of a learning curve and as always, the output from these results are urgent Best wishes, Lynn
Comment

Manos Magklis

Join Date: Sep 2018
Posts: 1

#11

12 Sep 2018, 10:55

Thanks a lot to Nick for the inspiration - I must say I've already learnt a lot from him.
This is my first statalist post so apologies for any mistakes in the way you quote/post code or reply

I thought I'd just share the following bit of code which I built based on Nick's code.
I had this idea of creating some code that would create a histogram with all the important points (means, medians, different ranges (like (y-sd, y+sd) or (p25, p75) that you are usually interested in when exploring a dataset. So I wrote the thing below and I thought I'd share in case anyone finds it useful.

Cheers everyone!

Code:

set scheme vg_rose
webuse grunfeld, clear

gen m_sd=.
gen msd=.
gen m_2sd=.
gen m2sd=.
gen m_3sd=.
gen m3sd=.
gen where=.
gen where2=.
gen per25=.
gen per75=.
gen where3=.
foreach var of varlist mvalue kstock {
sum `var', d
local mean=r(mean)
local median=r(p50)
local p25=r(p25)
local p75=r(p75)
local p10=r(p10)
local p90=r(p90)
local max=r(max)
local min=r(min)
replace m_sd= r(mean) - r(sd)
replace msd = r(mean) + r(sd)
replace m_2sd = r(mean) - 2*r(sd)
replace m2sd= r(mean) + 2*r(sd)
replace m_3sd = r(mean) - 3*r(sd)
replace m3sd= r(mean) + 3*r(sd)
replace where = -0.3
replace where2 = -0.6
replace where3=-0.9
replace per25=r(p25)
replace per75=r(p75)
twoway (hist `var', percent xaxis(1 2) fcolor(grey%30) lcolor(grey%1) bin(50) xtitle(`var') ytitle(Percent) ///
xline(`p10' `p90', lwidth(0.2) lpattern(dash) noextend) ///
xline(`p25' `p75', lwidth(0.2) lcolor(orange%50) noextend) ///
xline(`mean', lwidth(0.5) lcolor(black%60) noextend) ///
xline(`median', lwidth(0.5) lcolor(black%80) noextend) ///
xlabel(`p10' "p10" `p25' "p25" `p75' "p75" `p90' "p90" `mean' "mean" `median' "median", axis(2) labcolor(black) labsize(vsmall) angle(65) alternate ) ///
xlabel(`p10' `p25' `mean' `median' `p75' `p90', format(%9.2f) axis(1) labsize(2.5) angle(65) alternate ) ///
xscale(noline axis(2))) || ///
rbar m_sd msd where , horiz barw(0.2) legend(off) || ///
rbar m_2sd m2sd where2, horiz barw(0.2) legend(off) || ///
rbar m_3sd m3sd where3, horiz barw(0.2) legend(off)
}

Originally posted by Nick Cox View Post

This shows some technique:

Code:

sysuse nlsw88, clear

summarize wage
gen low = r(mean) - r(sd)
gen high = r(mean) + r(sd)
gen where = -0.005

twoway histogram wage , ///
fc(none) lc(green) xtitle("`: var label wage'") ytitle(density) ///
|| rbar low high where, horiz barw(0.005) legend(off)

Comment

Mohsin Javed

Join Date: Jun 2019

Posts: 55
#12

23 Sep 2019, 17:50

Nick:

What does where= -0.005 mean in post # 7?

What does it represent?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#13

23 Sep 2019, 18:04

The variable where gives the vertical position of a bar underneath the histogram. Its value is constant — because the bar is horizontal; negative — because the bar is to go below the horizontal axis, which is at y = 0; and small — because it is to go just below the axis.

The precise value depends on the range on the y axis, which here shows probability density. Depending on that range you might need a different value.

The code is deliberately reproducible, assuming only a standard Stata installation, so that you can run it to see what happens.

Last edited by Nick Cox; 23 Sep 2019, 18:11.
Comment
Pandelis Andreou

Join Date: Dec 2018

Posts: 22
#14

24 Feb 2023, 15:59

I would like to create this type of graph, for a specific variable (e.g. salary) where:

a) the vertical lines mark the mean, the median and a specific value (set 'manually' by me).

b) Those are values to be labeled in the graph as: 'mean', 'median' and 'minimum value'.

I tried to create these from the commands suggested above, without any success....
Attached Files
Comment

Announcement