Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • multidensity package available from SSC

    With repeated thanks to Kit Baum, a new package multidensity is available from SSC. It is billed as requiring Stata 8, but in truth I have not tested it against Stata 8 which has long since been inaccessible to me. Conversely, I will flag that some of the examples in the help won't work unless in Stata 15 up.

    The focus of this package -- just containing a single command with the same name -- is kernel density estimation, long since supported by official commands, but its selling point is as a convenience wrapper. The help file gives details as usual, so the purpose of this post is to give examples, so that you can quickly see whether the command might be interesting or useful to you.

    To be different, let's look at the auto data and first focus on the variable
    price and make up for StataCorp's omission of units of measurement. (The spirits of my high school science teachers would all be in anguish otherwise.)

    Code:
     set scheme s1color
    
     sysuse auto, clear
     label var price "Price (USD)"


    multidensity uses subcommands, the most useful being generate (which as usual can abbreviated all the way down to g, except that most users seem to find gen most congenial).

    In this example, we ask for density to be estimated over the range 0 to 18000 (USD) separately by
    foreign and using the default kernel (Epanechnikov) and using whatever default bandwidths kdensity chooses. That leaves four variables in memory, two density variables and two defining grids for the variable(s) in question. Given those extra variables, we can ask for graphs. First we ask for superimposed graphs, all the results in one panel. The three examples range from plain to fancy to fancier, plain lines, recast areas (transparency here requiring Stata 15 up) and then with some more work rugs on the bottom showing the distinct values in each cases (not "unique", please).

    Code:
     
     multidensity gen price, by(foreign) min(0) max(18000)
     multidensity super, name(G1, replace)
     multidensity super, recast(area) opt1(lcolor(orange) color(orange%40)) opt2(lcolor(blue) color(blue%40)) title("Price (USD)") name(G2, replace)
     su _density1, meanonly
     local max = r(max)
     su _density2, meanonly
     local max = max(`max', r(max))
     gen where1 = -`max'/15
     gen where0 = -`max'/30
     local rugcode addplot(scatter where0 price if !foreign, ms(|) mc(orange) || scatter where1 price if foreign, ms(|) mc(blue))
     multidensity super, recast(area) opt1(lcolor(orange) color(orange%40)) opt2(lcolor(blue) color(blue%40)) title("Price (USD)") ytitle(Density) `rugcode' name(G3, replace)



    Click image for larger version

Name:	md_G1.png
Views:	1
Size:	31.2 KB
ID:	1562663


    Click image for larger version

Name:	md_G2.png
Views:	1
Size:	27.6 KB
ID:	1562664



    Click image for larger version

Name:	md_G3.png
Views:	1
Size:	29.1 KB
ID:	1562665

    Now let's clear the result variables out of the way and try something different.

    A standard issue is that the results of kernel density estimation depend on the bandwidth, with a Goldilocks-like choice: it should be neither too small nor too big, where congenial size depends on taste and circumstance.

    I experimented earlier and identified some possible bandwidths, here with a biweight kernel, for which I have an unreasonable affection. The gen call will produce four density variables: as they are all for the same price variable, we won't want them to be labelled identically, but rather want the variable labels to show the bandwidths. Then again we can choose a graph that is superimposed or -- as one of more alternatives -- one that is what I call "by style", because under the hood the data are temporarily restructured to allow a by() call.

    Code:
     multidensity clear
     multidensity gen price, kernel(biweight) bw(400 600 800 1000) labelwith(bwidth)
     multidensity super, title(Price (USD)) opt1(lp(dash)) opt3(lp(dash)) xla(4000(4000)16000) name(G4, replace)
     multidensity bystyle, byopts(title(Price (USD)) note("biweight kernels, different bandwidth")) name(G5,  replace)
    


    Click image for larger version

Name:	md_G4.png
Views:	1
Size:	40.5 KB
ID:	1562666


    Click image for larger version

Name:	md_G5.png
Views:	1
Size:	34.9 KB
ID:	1562667

    For close comparisons of broadly similar curves, it is often better to have them in the same panel. For once, I think that the display in separate panels is clearer here, and clear enough to make the main point about the effects of varying bandwidth.




  • #2
    This is a continuation of the previous post, given a limit on attachments per post.

    You could have done something very similar to everything so far with just a little more work and kdensity or twoway kdensity. .

    In contrast, here is something less common. I've been trying to push the idea, which certainly isn't mine, in Stata circles for over 20 years, as shown by mdensity (on SSC since 1999, and superseded well and truly by this beast) and tkdensity (on SSC since 2012; also superseded by this), but without obvious sign of influencing anyone else.

    The idea is to transform, estimate density on a transformed scale, and then transform back. It is common, and characteristic, that positively skewed distributions are more ragged in the longer tail,as indeed can be seen in most displays above, which leads to an idea that they should be smoothed relatively more over higher values -- and relatively less over lower values. We do precisely that by working on a transformed scale. Let's fire up some possible transformations. As before, the variable labels for the results should indicate the transformations used.

    Code:
     multidensity clear
     multidensity gen price, trans(identity root cube_root log) labelwith(trans)
     multidensity bystyle, byopts(title(Price (USD)) note("transform, estimate and back-transform")) name(G6, replace)
    Click image for larger version

Name:	md_G6.png
Views:	1
Size:	34.6 KB
ID:	1562669


    It does not seem to matter too much which transformation we use. They all tidy up the right tail a little.

    Finally, let's look at a basic descriptive or exploratory display of some key variables. As is common they have very different units of measurement and different magnitudes, so "by style" is also possible, as is a style I call juxtaposed, meaning simply separate plots put together with graph combine.

    [/CODE]
    multidensity clear
    multidensity gen price weight mpg length
    multidensity juxta, combineopts(name(G7, replace))
    multidensity bystyle, name(G8, replace)
    [/CODE]

    Click image for larger version

Name:	md_G7.png
Views:	1
Size:	34.2 KB
ID:	1562670

    Click image for larger version

Name:	md_G8.png
Views:	1
Size:	33.6 KB
ID:	1562671

    Last edited by Nick Cox; 09 Jul 2020, 16:29.

    Comment


    • #3
      Nick, thanks for this. I'm doing a lot of kernel density work at the moment and this is a very welcome addition to SSC. Do you know if there are any Stata commands around to calculate suggested bandwidths, eg Sheather-Jones? If not it would be a valuable addition to multidensity...

      Comment


      • #4
        Robert:

        Automating bandwidth selection is a genuine problem for many people with lots of curves to process and an intriguing problem mixing theory and computing for those so inclined. I've always been a bit sceptical myself but -- more importantly for your question -- I've not ever tried to implement it myself and have no plans in that respect. I can't speak for any of the other community-contributed commands.

        More mundane, but personally worrying: not long after posting this I decided that multidensity is a poor name, as too likely to seem to imply multivariate density estimation, and if I think of a better one, I will change it.

        Comment


        • #5
          Why not do both multiple densities and multiple variables?? Simple enough with zero correlations. Not sure how generically one could provide a program for copula kernels. Bung in Krige's gaussian process BLUP while you're at it (although my money would be on that appearing in [SP] Stata 17).

          (Suggesting work for other people to do is so easy! I know you have a lot on.)

          Comment

          Working...
          X