-cisets- downloadable from SSC: confidence interval sets

Nick Cox

Join Date: Mar 2014

Posts: 35357
#1

-cisets- downloadable from SSC: confidence interval sets

29 Oct 2024, 12:46

Thanks as always to Kit Baum, a new command cisets is now downloadable from SSC. Stata 14.1 is required.

Here is the executive summary (edited very slightly) from

Code:

ssc desc cisets

to enable you to decide quickly whether you should care.

cisets computes confidence interval sets for population means,
proportions, variances, standard deviations, geometric means,
harmonic means, and centiles. It is a wrapper for variously ci
(for means, proportions, variances and standard deviations);
ameans (for geometric and harmonic means); and centile. All
syntaxes include cisets followed by a subcommand which specifies
a particular summary measure. There are two flavours of commands
otherwise, for variables and for groups of observations.
Confidence interval sets are temporary datasets that may be saved
as permanent datasets and in turn combined with other sets using
append or merge. That allows graphical and other applications.

If you're still here, skim or skip through the rest of this post while being on the lookout for incidental or accidental humour.

The primary intent of cisets is to prepare for plotting of confidence intervals for standard descriptive measures. It does no plotting itself. That deserves a small story.

Plotting confidence intervals for say means is a long-standing and standard problem for which I've posted various commands. The command of mine of most interest under this heading is ciplot from SSC, which modulo one small change has not been altered since 2003. That itself is not fatal to its usefulness, but while it is still posted for anyone who wants it, I've been trying to discourage its use on various grounds.

1. Vince Wiggins, for several years a leading developer at StataCorp, made a cogent throwaway remark that in many confidence interval plotting problems it is natural to call up statsby to do the hard work. I promptly borrowed or stole that idea and wrote it up in https://journals.sagepub.com/doi/pdf...867X1001000112 So, just use statsby is good advice often. The graphics is then usually a matter of standard twoway commands and personal style.

2. ciplot is a wrapper for ci as was but ci was updated in Stata 14.1 (which is a hint about the restriction in the first paragraph). Most of what people might want can be done using part of the old syntax, but that's at best a little awkward.

3. In many, perhaps most, projects confidence intervals should be plotted alongside the data -- or at least commands should allow that to be done easily. The worst offenders here are often so-called dynamite plots (detonator plots, plunger plots) which not only suppress the data but present estimates and confidence intervals in distorted form -- by showing estimates as fat bars starting at zero and confidence intervals (or other error bars) in fainter form or even with the lower part omitted. Frequently comparison with zero is not of statistical or scientific interest, so the emphasis is irrelevant or misleading.

That said, I started again with an idea of writing a kind of replacement for ciplot that did much more. My aim is emphatically not to try to undermine established commands such as marginsplot which is both more versatile and more complicated. I write something I want to use, primarily. The ideal of a large improvement proved elusive and illusive as tackling the problems I had in mind produced a mass of tricky code, while more thought made obvious that there were natural problems in comparing confidence intervals graphically that were beyond my design. In short, it was too ambitious and too limited at the same time.

So, change the problem if you can't find a solution. cisets is a wrapper for ci and also for ameans (primarily because I am often interested in geometric means) and centile (primarily because I am often interested in medians). The ideal, and the idea, is only to calculate the confidence intervals as new data -- that is, as new variables. By default, you only get a listing, which may be as much as you want, but more enterprising applications will require saving the dataset to a new file, and quite possibly a later append or merge with other confidence interval sets, or with your original data. Otherwise put, the intent is nothing to do with the results of fitted models, except insofar as descriptive statistics imply very simple models.

To give a little more flavour, here is what happens when you ask for confidence intervals for geometric means for mpg from the auto data:

Code:

. cisets gmean mpg, over(rep78) saving(foo, replace) +-------------------------------------------------------------------------------------------------------------------------------------------+ | varname varlabel origgvar groupvar gvarlabel group n statname point lb ub level | |-------------------------------------------------------------------------------------------------------------------------------------------| | mpg Mileage (mpg) 1 rep78 Repair record 1978 1 2 geometric mean 20.78461 3.341899 129.2678 95 | | mpg Mileage (mpg) 2 rep78 Repair record 1978 2 8 geometric mean 18.80446 15.95322 22.16529 95 | | mpg Mileage (mpg) 3 rep78 Repair record 1978 3 30 geometric mean 19.00496 17.52744 20.60703 95 | | mpg Mileage (mpg) 4 rep78 Repair record 1978 4 18 geometric mean 21.10463 18.72538 23.78618 95 | | mpg Mileage (mpg) 5 rep78 Repair record 1978 5 11 geometric mean 26.03027 20.74643 32.65983 95 | +-------------------------------------------------------------------------------------------------------------------------------------------+ file foo.dta saved . d using foo Contains data Observations: 5 29 Oct 2024 18:25 Variables: 12 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- varname str3 %9s varlabel str13 %13s origgvar int %8.0g Repair record 1978 groupvar str5 %9s gvarlabel str18 %18s group float %9.0g group Repair record 1978 n float %9.0g statname str14 %14s point float %9.0g lb float %9.0g ub float %9.0g level float %9.0g -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by:

At first sight the dataset may seem ridiculously redundant, but the larger point is that you can combine this with other such datasets, for different variables, different breakdowns, different summary measures (parameters), different levels, even in some cases different methods of calculating confidence intervals.

And from the dataset, one-off or as combined, you can then plot intervals according to taste. Here I use range bars for the intervals for medians, as inspired by Leland Wilkinson, as an alternative to the more conventional uncapped or capped spikes.

Here is the code for that, interspersed with results:

Code:

. sysuse auto, clear . cisets centile mpg, over(rep78) total saving(foo, replace) +-----------------------------------------------------------------------------------------------------------------------------------+ | varname varlabel origgvar groupvar gvarlabel group n statname point lb ub level | |-----------------------------------------------------------------------------------------------------------------------------------| | mpg Mileage (mpg) 1 rep78 Repair record 1978 1 2 50 pctile 21 18 24 95 | | mpg Mileage (mpg) 2 rep78 Repair record 1978 2 8 50 pctile 18 15.35 24 95 | | mpg Mileage (mpg) 3 rep78 Repair record 1978 3 30 50 pctile 19 18.12912 20.87088 95 | | mpg Mileage (mpg) 4 rep78 Repair record 1978 4 18 50 pctile 22.5 18 25 95 | | mpg Mileage (mpg) 5 rep78 Repair record 1978 5 11 50 pctile 30 17.71273 35 95 | |-----------------------------------------------------------------------------------------------------------------------------------| | mpg Mileage (mpg) . rep78 Repair record 1978 Total 69 50 pctile 20 18.86177 22 95 | +-----------------------------------------------------------------------------------------------------------------------------------+ file foo.dta saved . u foo, clear . replace statname = "median" (6 real changes made) . su ub, meanonly . local ubmax = r(max) . su lb, meanonly . local lbmin = r(min) . gen where = `lbmin' - (`ubmax' - `lbmin') / 10 . gen show_n = ("{it:n} = ") + strofreal(n) . twoway rbar ub lb group, lcolor(stc1) fcolor(none) barw(0.2) /// || scatter where group, ms(none) mla(show_n) mlabpos(0) mlabc(black) mlabsize(medium) /// || scatter point group, pstyle(p1) msize(medlarge) ms(D) xtitle("`=gvarlabel'") /// xla(1/6, valuelabel) xsc(r(0.5, 6.5)) legend(off) ytitle("`=varlabel'") subtitle("`=statname's: `=level'% confidence intervals", place(w))

Other examples do show the intervals alongside the original data.

Anyone interested -- except those behind fearsome firewalls -- can naturally download the code and look at the help file, noting that an ancillary file contains all the code for the examples.

A sequel command for percentile sets is already done and will perhaps be posted next week.
Tags: None

2 likes
Nick Cox

Join Date: Mar 2014

Posts: 35357
#2

30 Oct 2024, 06:56

As hinted above, you can use cisets in conjunction with other commands -- say scatter or histogram -- to show confidence intervals together with the original data. This particular example uses qplot from the Stata Journal. So I show geometric means and price data on logarithmic scale, a point that goes back to Galileo, although I have not seen it mentioned in any history of econometrics.

Code:

set scheme stcolor sysuse auto, clear cisets gmean price, over(foreign) saving(foo, replace) clonevar origgvar=foreign merge m:1 origgvar using foo gen where = 1 qplot price, ms(O) by(foreign, legend(off) note(95% confidence intervals for geometric means)) xla(0 0.25 "0.25" 0.5 "0.5" 0.75 "0.75" 1) addplot(rbar ub lb where, barw(0.08) fcolor(none) pstyle(p2) || scatter point where, ms(Dh) msize(medlarge) pstyle(p2)) ysc(log) yla(3000(2000)15000) ytitle(Price (USD)) xtitle(Fraction of data)

So, the sales pitch is this. If any graphical command supports an addplot() option it is a wide-open door to add code to show confidence intervals using standard twoway commands. It's a small matter of programming to merge back the confidence interval data set and to define a variable specifying where the display should go. (If you wanted the interval on the left, its horizontal position could be zero or a small negative number. The x axis labels do not lie: that scale really is from 0 to 1.)

In contrast to some other over-sold designs -- box plots, violin plots, and some others -- quantile plots are honest plots, showing gaps, spikes, outliers, skewness and what not without omission or distortion -- not to mention signals about (sub)-sample size.

Last edited by Nick Cox; 30 Oct 2024, 06:59.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35357
#3

05 Nov 2024, 11:16

Thanks again to Kit Baum, small tweaks have been made to the files on SSC.
Comment

Announcement

-cisets- downloadable from SSC: confidence interval sets

Comment

Comment