With repeated thanks to Kit Baum, a new package multidensity is available from SSC. It is billed as requiring Stata 8, but in truth I have not tested it against Stata 8 which has long since been inaccessible to me. Conversely, I will flag that some of the examples in the help won't work unless in Stata 15 up.
The focus of this package -- just containing a single command with the same name -- is kernel density estimation, long since supported by official commands, but its selling point is as a convenience wrapper. The help file gives details as usual, so the purpose of this post is to give examples, so that you can quickly see whether the command might be interesting or useful to you.
To be different, let's look at the auto data and first focus on the variable price and make up for StataCorp's omission of units of measurement. (The spirits of my high school science teachers would all be in anguish otherwise.)
multidensity uses subcommands, the most useful being generate (which as usual can abbreviated all the way down to g, except that most users seem to find gen most congenial).
In this example, we ask for density to be estimated over the range 0 to 18000 (USD) separately by foreign and using the default kernel (Epanechnikov) and using whatever default bandwidths kdensity chooses. That leaves four variables in memory, two density variables and two defining grids for the variable(s) in question. Given those extra variables, we can ask for graphs. First we ask for superimposed graphs, all the results in one panel. The three examples range from plain to fancy to fancier, plain lines, recast areas (transparency here requiring Stata 15 up) and then with some more work rugs on the bottom showing the distinct values in each cases (not "unique", please).
Now let's clear the result variables out of the way and try something different.
A standard issue is that the results of kernel density estimation depend on the bandwidth, with a Goldilocks-like choice: it should be neither too small nor too big, where congenial size depends on taste and circumstance.
I experimented earlier and identified some possible bandwidths, here with a biweight kernel, for which I have an unreasonable affection. The gen call will produce four density variables: as they are all for the same price variable, we won't want them to be labelled identically, but rather want the variable labels to show the bandwidths. Then again we can choose a graph that is superimposed or -- as one of more alternatives -- one that is what I call "by style", because under the hood the data are temporarily restructured to allow a by() call.
For close comparisons of broadly similar curves, it is often better to have them in the same panel. For once, I think that the display in separate panels is clearer here, and clear enough to make the main point about the effects of varying bandwidth.
The focus of this package -- just containing a single command with the same name -- is kernel density estimation, long since supported by official commands, but its selling point is as a convenience wrapper. The help file gives details as usual, so the purpose of this post is to give examples, so that you can quickly see whether the command might be interesting or useful to you.
To be different, let's look at the auto data and first focus on the variable price and make up for StataCorp's omission of units of measurement. (The spirits of my high school science teachers would all be in anguish otherwise.)
Code:
set scheme s1color sysuse auto, clear label var price "Price (USD)"
multidensity uses subcommands, the most useful being generate (which as usual can abbreviated all the way down to g, except that most users seem to find gen most congenial).
In this example, we ask for density to be estimated over the range 0 to 18000 (USD) separately by foreign and using the default kernel (Epanechnikov) and using whatever default bandwidths kdensity chooses. That leaves four variables in memory, two density variables and two defining grids for the variable(s) in question. Given those extra variables, we can ask for graphs. First we ask for superimposed graphs, all the results in one panel. The three examples range from plain to fancy to fancier, plain lines, recast areas (transparency here requiring Stata 15 up) and then with some more work rugs on the bottom showing the distinct values in each cases (not "unique", please).
Code:
multidensity gen price, by(foreign) min(0) max(18000) multidensity super, name(G1, replace) multidensity super, recast(area) opt1(lcolor(orange) color(orange%40)) opt2(lcolor(blue) color(blue%40)) title("Price (USD)") name(G2, replace) su _density1, meanonly local max = r(max) su _density2, meanonly local max = max(`max', r(max)) gen where1 = -`max'/15 gen where0 = -`max'/30 local rugcode addplot(scatter where0 price if !foreign, ms(|) mc(orange) || scatter where1 price if foreign, ms(|) mc(blue)) multidensity super, recast(area) opt1(lcolor(orange) color(orange%40)) opt2(lcolor(blue) color(blue%40)) title("Price (USD)") ytitle(Density) `rugcode' name(G3, replace)
Now let's clear the result variables out of the way and try something different.
A standard issue is that the results of kernel density estimation depend on the bandwidth, with a Goldilocks-like choice: it should be neither too small nor too big, where congenial size depends on taste and circumstance.
I experimented earlier and identified some possible bandwidths, here with a biweight kernel, for which I have an unreasonable affection. The gen call will produce four density variables: as they are all for the same price variable, we won't want them to be labelled identically, but rather want the variable labels to show the bandwidths. Then again we can choose a graph that is superimposed or -- as one of more alternatives -- one that is what I call "by style", because under the hood the data are temporarily restructured to allow a by() call.
Code:
multidensity clear
multidensity gen price, kernel(biweight) bw(400 600 800 1000) labelwith(bwidth)
multidensity super, title(Price (USD)) opt1(lp(dash)) opt3(lp(dash)) xla(4000(4000)16000) name(G4, replace)
multidensity bystyle, byopts(title(Price (USD)) note("biweight kernels, different bandwidth")) name(G5, replace)
For close comparisons of broadly similar curves, it is often better to have them in the same panel. For once, I think that the display in separate panels is clearer here, and clear enough to make the main point about the effects of varying bandwidth.
Comment