This is a brief puff for an idea that has become standard in some quarters, but seems to deserve a bigger push until everyone who might care knows about it. Here is a reproducible example, which as always is indicative, not definitive.
Kernel density estimates are plotted by default in Stata as lines, meaning curves. It is elementary (meaning, fundamental) that area under the curve has an interpretation as probability.
Often area-based graphs say in a complicated way what could be said much more simply. Bad examples include bars with arbitrary bases that could just be replaced by point symbols for the values in question, or bars that start at zero, when not being zero is banal or irrelevant.
However, area graphs can be helpful when comparing two or more distributions. (Histograms work that way.) But then transparency becomes vital to see overlap clearly.
You can do something like this directly with kdensity or twoway density with the option recast(area). There is no special rationale for coding as above, although the default of truncating the density at the observed extremes can be unfortunate, so I typically work a little harder at setting up a wider grid on which to calculate estimates.
The immediate inspiration for this came from an excellent book by Claus Wilke. This is a link to a review I wrote with several detailed comments: https://www.amazon.com/gp/customer-reviews/R22MWD7RJ6QAFP
Code:
sysuse auto, clear gen where = _n + 4 in 1/45 local choices kernel(biweight) bw(5) at(where) kdensity mpg if foreign, `choices' gen(x1 d1) kdensity mpg if !foreign, `choices' gen(x0 d0) gen rug1 = -0.004 gen rug0 = -0.008 twoway area d1 d0 where, xtitle("`: var label mpg'") color(orange%40 blue%40) /// || scatter rug1 mpg if foreign, ms(|) mc(orange) msize(medlarge) /// || scatter rug0 mpg if !foreign, ms(|) mc(blue) msize(medlarge) /// legend(order(1 "Foreign" 2 "Domestic") pos(1) ring(0) col(1)) /// ytitle(Probability density) yla(, ang(h)) xla(10(10)40)
Kernel density estimates are plotted by default in Stata as lines, meaning curves. It is elementary (meaning, fundamental) that area under the curve has an interpretation as probability.
Often area-based graphs say in a complicated way what could be said much more simply. Bad examples include bars with arbitrary bases that could just be replaced by point symbols for the values in question, or bars that start at zero, when not being zero is banal or irrelevant.
However, area graphs can be helpful when comparing two or more distributions. (Histograms work that way.) But then transparency becomes vital to see overlap clearly.
You can do something like this directly with kdensity or twoway density with the option recast(area). There is no special rationale for coding as above, although the default of truncating the density at the observed extremes can be unfortunate, so I typically work a little harder at setting up a wider grid on which to calculate estimates.
The immediate inspiration for this came from an excellent book by Claus Wilke. This is a link to a review I wrote with several detailed comments: https://www.amazon.com/gp/customer-reviews/R22MWD7RJ6QAFP
Comment