Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -cisets- downloadable from SSC: confidence interval sets

    Thanks as always to Kit Baum, a new command cisets is now downloadable from SSC. Stata 14.1 is required.

    Here is the executive summary (edited very slightly) from

    Code:
    ssc desc cisets
    to enable you to decide quickly whether you should care.

    cisets computes confidence interval sets for population means,
    proportions, variances, standard deviations, geometric means,
    harmonic means, and centiles. It is a wrapper for variously ci
    (for means, proportions, variances and standard deviations);
    ameans (for geometric and harmonic means); and centile. All
    syntaxes include cisets followed by a subcommand which specifies
    a particular summary measure. There are two flavours of commands
    otherwise, for variables and for groups of observations.
    Confidence interval sets are temporary datasets that may be saved
    as permanent datasets and in turn combined with other sets using
    append or merge. That allows graphical and other applications.

    If you're still here, skim or skip through the rest of this post while being on the lookout for incidental or accidental humour.

    The primary intent of cisets is to prepare for plotting of confidence intervals for standard descriptive measures. It does no plotting itself. That deserves a small story.

    Plotting confidence intervals for say means is a long-standing and standard problem for which I've posted various commands. The command of mine of most interest under this heading is ciplot from SSC, which modulo one small change has not been altered since 2003. That itself is not fatal to its usefulness, but while it is still posted for anyone who wants it, I've been trying to discourage its use on various grounds.

    1. Vince Wiggins, for several years a leading developer at StataCorp, made a cogent throwaway remark that in many confidence interval plotting problems it is natural to call up statsby to do the hard work. I promptly borrowed or stole that idea and wrote it up in https://journals.sagepub.com/doi/pdf...867X1001000112 So, just use statsby is good advice often. The graphics is then usually a matter of standard twoway commands and personal style.

    2. ciplot is a wrapper for ci as was but ci was updated in Stata 14.1 (which is a hint about the restriction in the first paragraph). Most of what people might want can be done using part of the old syntax, but that's at best a little awkward.

    3. In many, perhaps most, projects confidence intervals should be plotted alongside the data -- or at least commands should allow that to be done easily. The worst offenders here are often so-called dynamite plots (detonator plots, plunger plots) which not only suppress the data but present estimates and confidence intervals in distorted form -- by showing estimates as fat bars starting at zero and confidence intervals (or other error bars) in fainter form or even with the lower part omitted. Frequently comparison with zero is not of statistical or scientific interest, so the emphasis is irrelevant or misleading.

    That said, I started again with an idea of writing a kind of replacement for ciplot that did much more. My aim is emphatically not to try to undermine established commands such as marginsplot which is both more versatile and more complicated. I write something I want to use, primarily. The ideal of a large improvement proved elusive and illusive as tackling the problems I had in mind produced a mass of tricky code, while more thought made obvious that there were natural problems in comparing confidence intervals graphically that were beyond my design. In short, it was too ambitious and too limited at the same time.

    So, change the problem if you can't find a solution. cisets is a wrapper for ci and also for ameans (primarily because I am often interested in geometric means) and centile (primarily because I am often interested in medians). The ideal, and the idea, is only to calculate the confidence intervals as new data -- that is, as new variables. By default, you only get a listing, which may be as much as you want, but more enterprising applications will require saving the dataset to a new file, and quite possibly a later append or merge with other confidence interval sets, or with your original data. Otherwise put, the intent is nothing to do with the results of fitted models, except insofar as descriptive statistics imply very simple models.

    To give a little more flavour, here is what happens when you ask for confidence intervals for geometric means for mpg from the auto data:

    Code:
    . cisets gmean mpg, over(rep78) saving(foo, replace)
    
      +-------------------------------------------------------------------------------------------------------------------------------------------+
      | varname        varlabel   origgvar   groupvar            gvarlabel   group    n         statname      point         lb         ub   level |
      |-------------------------------------------------------------------------------------------------------------------------------------------|
      |     mpg   Mileage (mpg)          1      rep78   Repair record 1978       1    2   geometric mean   20.78461   3.341899   129.2678      95 |
      |     mpg   Mileage (mpg)          2      rep78   Repair record 1978       2    8   geometric mean   18.80446   15.95322   22.16529      95 |
      |     mpg   Mileage (mpg)          3      rep78   Repair record 1978       3   30   geometric mean   19.00496   17.52744   20.60703      95 |
      |     mpg   Mileage (mpg)          4      rep78   Repair record 1978       4   18   geometric mean   21.10463   18.72538   23.78618      95 |
      |     mpg   Mileage (mpg)          5      rep78   Repair record 1978       5   11   geometric mean   26.03027   20.74643   32.65983      95 |
      +-------------------------------------------------------------------------------------------------------------------------------------------+
    
    file foo.dta saved
    
    . d using foo 
    
    Contains data                                 
     Observations:             5                  29 Oct 2024 18:25
        Variables:            12                  
    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Variable      Storage   Display    Value
        name         type    format    label      Variable label
    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    varname         str3    %9s                   
    varlabel        str13   %13s                  
    origgvar        int     %8.0g                 Repair record 1978
    groupvar        str5    %9s                   
    gvarlabel       str18   %18s                  
    group           float   %9.0g      group      Repair record 1978
    n               float   %9.0g                 
    statname        str14   %14s                  
    point           float   %9.0g                 
    lb              float   %9.0g                 
    ub              float   %9.0g                 
    level           float   %9.0g                 
    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Sorted by:
    At first sight the dataset may seem ridiculously redundant, but the larger point is that you can combine this with other such datasets, for different variables, different breakdowns, different summary measures (parameters), different levels, even in some cases different methods of calculating confidence intervals.

    And from the dataset, one-off or as combined, you can then plot intervals according to taste. Here I use range bars for the intervals for medians, as inspired by Leland Wilkinson, as an alternative to the more conventional uncapped or capped spikes.

    Click image for larger version

Name:	CI4.png
Views:	1
Size:	36.0 KB
ID:	1766638



    Here is the code for that, interspersed with results:

    Code:
    . sysuse auto, clear 
    . cisets centile mpg, over(rep78) total saving(foo, replace)
    
      +-----------------------------------------------------------------------------------------------------------------------------------+
      | varname        varlabel   origgvar   groupvar            gvarlabel   group    n    statname   point         lb         ub   level |
      |-----------------------------------------------------------------------------------------------------------------------------------|
      |     mpg   Mileage (mpg)          1      rep78   Repair record 1978       1    2   50 pctile      21         18         24      95 |
      |     mpg   Mileage (mpg)          2      rep78   Repair record 1978       2    8   50 pctile      18      15.35         24      95 |
      |     mpg   Mileage (mpg)          3      rep78   Repair record 1978       3   30   50 pctile      19   18.12912   20.87088      95 |
      |     mpg   Mileage (mpg)          4      rep78   Repair record 1978       4   18   50 pctile    22.5         18         25      95 |
      |     mpg   Mileage (mpg)          5      rep78   Repair record 1978       5   11   50 pctile      30   17.71273         35      95 |
      |-----------------------------------------------------------------------------------------------------------------------------------|
      |     mpg   Mileage (mpg)          .      rep78   Repair record 1978   Total   69   50 pctile      20   18.86177         22      95 |
      +-----------------------------------------------------------------------------------------------------------------------------------+
    
    file foo.dta saved
    
    . u foo, clear
    
    . replace statname = "median"
    (6 real changes made)
    
    . su ub, meanonly
    
    . local ubmax = r(max)
    
    . su lb, meanonly 
    
    . local lbmin = r(min)
    
    . gen where = `lbmin' - (`ubmax' - `lbmin') / 10 
    
    . gen show_n = ("{it:n} = ") + strofreal(n)
    
    . twoway rbar ub lb group, lcolor(stc1) fcolor(none) barw(0.2) ///
    || scatter where group, ms(none) mla(show_n) mlabpos(0) mlabc(black) mlabsize(medium) ///
    || scatter point group, pstyle(p1) msize(medlarge)  ms(D) xtitle("`=gvarlabel'")   ///
    xla(1/6, valuelabel) xsc(r(0.5, 6.5)) legend(off) ytitle("`=varlabel'") subtitle("`=statname's: `=level'% confidence intervals", place(w))
    Other examples do show the intervals alongside the original data.

    Anyone interested -- except those behind fearsome firewalls -- can naturally download the code and look at the help file, noting that an ancillary file contains all the code for the examples.

    A sequel command for percentile sets is already done and will perhaps be posted next week.

  • #2
    As hinted above, you can use cisets in conjunction with other commands -- say scatter or histogram -- to show confidence intervals together with the original data. This particular example uses qplot from the Stata Journal. So I show geometric means and price data on logarithmic scale, a point that goes back to Galileo, although I have not seen it mentioned in any history of econometrics.

    Code:
    set scheme stcolor
    
    sysuse auto, clear
    
    cisets gmean price, over(foreign) saving(foo, replace)
    
    clonevar origgvar=foreign
    
    merge m:1 origgvar using foo
    
    gen where = 1
    
    qplot price, ms(O) by(foreign, legend(off) note(95% confidence intervals for geometric means)) xla(0 0.25 "0.25" 0.5 "0.5" 0.75 "0.75" 1) addplot(rbar ub lb where, barw(0.08) fcolor(none) pstyle(p2) || scatter point where, ms(Dh) msize(medlarge) pstyle(p2)) ysc(log) yla(3000(2000)15000) ytitle(Price (USD)) xtitle(Fraction of data)
    Click image for larger version

Name:	cisets_gmean.png
Views:	1
Size:	36.5 KB
ID:	1766670





    So, the sales pitch is this. If any graphical command supports an addplot() option it is a wide-open door to add code to show confidence intervals using standard twoway commands. It's a small matter of programming to merge back the confidence interval data set and to define a variable specifying where the display should go. (If you wanted the interval on the left, its horizontal position could be zero or a small negative number. The x axis labels do not lie: that scale really is from 0 to 1.)

    In contrast to some other over-sold designs -- box plots, violin plots, and some others -- quantile plots are honest plots, showing gaps, spikes, outliers, skewness and what not without omission or distortion -- not to mention signals about (sub)-sample size.
    Last edited by Nick Cox; 30 Oct 2024, 07:59.

    Comment


    • #3
      Thanks again to Kit Baum, small tweaks have been made to the files on SSC.

      Comment

      Working...
      X