Thanks as always to Kit Baum, the package tabplot on SSC has been
updated with new ado and help files for that program, which goes back to
1999. Stata 8 is required. tabplot is billed as supporting one-, two-
and three-way bar charts for tables, which understates its possibilities
a little, but the whole story need not be given here.
"Multiple bar charts" would be a good umbrella term, except for the need
to explain that doesn't mean stacked or divided bars and it doesn't mean
bars side by side on the same axis (and except for the puzzle that a
single bar would just get lonely, so don't all bar charts have multiple
bars?). (A single bar does not mean a "singles bar".)
The update in code fixes some awkward, indeed deficient, parsing of
calls to the by() option, which ruled out adjustment of a note() call
together with the by() option.
A bigger deal by comparison is much re-writing of the help file, with
restructured explanation of syntax, better-explained and more numerous
examples, and many more references since the last update several months
ago.
If interested, then use
to install afresh or
to update an existing installation; some readers may be using
instead.
Bar charts are basic, and may seem very well supported in Stata, as only
a little acquaintance with the documentation reveals four commands,
graph bar, graph hbar, twoway bar and twoway rbar, which might seem
already three more than one might need.
Another command for bar charts (or more; I have others) thus needs a
little explanation. This one is itself just a wrapper for twoway rbar,
but it can do various plots more easily than you could do yourself,
unless you were willing to do a little programming and a lot of fiddling
around.
The main conceit of tabplot is table-like plots. The name is intended to
evoke commands like tabulate with their structured output of tables in
rows and columns.
Incidentally, I note that there is a tabplot package for R with its main
command tableplot; an old Stata command of mine called tableplot also
exists on SSC, but its main capabilities have long since been folded
into tabplot. I don't doubt that tabplot on R is good, but I've never
used it or studied its documentation closely. I am pretty sure that I
used the name first, not that I mind so long as the name remains
distinct within Stata.
Clearly the help file is there with the details you are expected to
want, so the best I can now do for anyone curious is to give a couple of
self-contained examples, together with a moderate sales pitch.
Other applications of tabplot can be found at
http://www.statalist.org/forums/foru...-and-subgraphs
http://www.statalist.org/forums/foru...something-else
http://www.statalist.org/forums/foru...d-with-grc1leg
http://www.statalist.org/forums/foru...lot-or-tabplot
http://stats.stackexchange.com/quest...inal-variables
http://stats.stackexchange.com/quest...ical-variables
Greenacre (2007, p.42; full reference below) gave these data from the
Encuesta Nacional de la Salud (Spanish National Health Survey), 1997.
They are interesting in themselves, but for my purposes they are useful
as an example large enough to be challenging. As with many tables, the
main handle for understanding is to look at the probability distribution
of the response health given the predictor age. tabplot offers options
to calculate percent or proportional/fractional breakdowns on the fly.
Aesthetic preferences or conventions often encourage presentation in
terms of percents. ("Percentage" seems to me too long a word, whatever
dictionaries may say.)
What particularly bites here are some very small percents, which are
perfectly credible and not at all unusual for such data. A merit of the
multiple bar charts design is that small values are discernible as such.
Note especially the showval option, which insists on showing values too.
The graph thus deliberately uses table ideas and graph ideas together.
Sometimes people say to me, "But you shouldn't do that!" and some
prohibition emerges that graphs are graphs and tables and tables, and
ne'er the twain shall meet, which seems to me no more than superstition.
Digression. An intriguing suggestion, which I have borrowed elsewhere,
is that the conventional distinction between graphs and tables was a
side-effect of the development of printing. Before printing there were
manuscripts -- those scripted manually, or written by hand -- to which
writers could add illustrations, say of knights, or dragons, or of
sinners being tormented, or something equally entertaining, as they
liked and where they liked. Printed documents encouraged, or even
enforced, a division of labour between typesetters and those who
prepared illustrations. But now that's obsolete.
A detailed objection to numeric values too is that they clutter up the
graph, to which the answers are it depends on how you do it, and if
you strongly object it's not compulsory. But tabplot gives up on
labelling axes with bar magnitudes, so that reduces clutter too.
Given this dataset, how else would you represent the patterns
graphically? Setting aside any temptation to draw multiple pie charts,
one alternative is a stacked bar chart:
In recent Stata versions, graph hbar could also do this directly, but the syntax
differs.
I have not tried to hard to optimise this: the colour scheme and legend both need work,
and so forth. Some would prefer vertical bars here.
The key point is whether it could be made better (clearer, more effective,
more attractive) than the previous graph. I note three key issues:
1. Stacking is a well-understood design but very small amounts are hard to work
to discern.
2. A legend necessarily springs into being, but a legend obliges mental "back
and forth" from readers (or else readers give up on looking at the detail).
3. The program would let you add numeric values on top of the bars, but that would
be at least a little messy.
Naturally this is a straw graph that I set up to knock down again, but are there good
alternatives? I've had better results with unstacked bars for this example, but I
will move on.
Let's look at graphs for a three-way table.
Aitkin et al. (1989, p.242; full reference below) reported data from a
survey of student opinion on the Vietnam War taken at the University of
North Carolina in Chapel Hill in May 1967. Students were classified by
sex, year of study, and the policy they supported, given choices of
A. The United States should defeat the power of North Vietnam by
widespread bombing of its industries, ports, and harbors and by land
invasion.
B. The United States should follow the present policy in Vietnam.
C. The United States should de-escalate its military activity, stop
bombing North Vietnam, and intensify its efforts to begin negotiation.
D. The United States should withdraw its military forces from Vietnam
immediately.
The labels A ... D are fairly dopey, but even at this distance
suggesting better ones might be thought contentious politically, so I
will desist.
The way to plot three-way tables is unsurprisingly by using a by() option to repeat two-way tables.
The syntax for tabplot matches standard conventions such that (as in regress and scatter, for
example) it is usually best to mention the response or outcome variable first (as defining rows of
the plot, and as to be shown on the y axis). There can be trade-offs or compromises,
as no layout is best for all purposes, but big differences can safely be put at a distance (so
males and females here differ markedly in their mix of views), while finer distinctions are
easier to make if bars are close. On top of all that, any ordinal scales should naturally be
respected as such.
Aitkin, M., D. Anderson, B. Francis, and J. Hinde. 1989. Statistical
Modelling in GLIM. Oxford: Oxford University Press
Greenacre, M. 2007. Correspondence analysis in practice. Boca Raton, FL:
Chapman & Hall/CRC
updated with new ado and help files for that program, which goes back to
1999. Stata 8 is required. tabplot is billed as supporting one-, two-
and three-way bar charts for tables, which understates its possibilities
a little, but the whole story need not be given here.
"Multiple bar charts" would be a good umbrella term, except for the need
to explain that doesn't mean stacked or divided bars and it doesn't mean
bars side by side on the same axis (and except for the puzzle that a
single bar would just get lonely, so don't all bar charts have multiple
bars?). (A single bar does not mean a "singles bar".)
The update in code fixes some awkward, indeed deficient, parsing of
calls to the by() option, which ruled out adjustment of a note() call
together with the by() option.
A bigger deal by comparison is much re-writing of the help file, with
restructured explanation of syntax, better-explained and more numerous
examples, and many more references since the last update several months
ago.
If interested, then use
Code:
ssc inst tabplot
Code:
ssc inst tabplot, replace
Code:
adoupdate
Bar charts are basic, and may seem very well supported in Stata, as only
a little acquaintance with the documentation reveals four commands,
graph bar, graph hbar, twoway bar and twoway rbar, which might seem
already three more than one might need.
Another command for bar charts (or more; I have others) thus needs a
little explanation. This one is itself just a wrapper for twoway rbar,
but it can do various plots more easily than you could do yourself,
unless you were willing to do a little programming and a lot of fiddling
around.
The main conceit of tabplot is table-like plots. The name is intended to
evoke commands like tabulate with their structured output of tables in
rows and columns.
Incidentally, I note that there is a tabplot package for R with its main
command tableplot; an old Stata command of mine called tableplot also
exists on SSC, but its main capabilities have long since been folded
into tabplot. I don't doubt that tabplot on R is good, but I've never
used it or studied its documentation closely. I am pretty sure that I
used the name first, not that I mind so long as the name remains
distinct within Stata.
Clearly the help file is there with the details you are expected to
want, so the best I can now do for anyone curious is to give a couple of
self-contained examples, together with a moderate sales pitch.
Other applications of tabplot can be found at
http://www.statalist.org/forums/foru...-and-subgraphs
http://www.statalist.org/forums/foru...something-else
http://www.statalist.org/forums/foru...d-with-grc1leg
http://www.statalist.org/forums/foru...lot-or-tabplot
http://stats.stackexchange.com/quest...inal-variables
http://stats.stackexchange.com/quest...ical-variables
Greenacre (2007, p.42; full reference below) gave these data from the
Encuesta Nacional de la Salud (Spanish National Health Survey), 1997.
They are interesting in themselves, but for my purposes they are useful
as an example large enough to be challenging. As with many tables, the
main handle for understanding is to look at the probability distribution
of the response health given the predictor age. tabplot offers options
to calculate percent or proportional/fractional breakdowns on the fly.
Aesthetic preferences or conventions often encourage presentation in
terms of percents. ("Percentage" seems to me too long a word, whatever
dictionaries may say.)
Code:
clear input byte(agegroup health) long freq 1 1 243 1 2 789 1 3 167 1 4 18 1 5 6 2 1 220 2 2 809 2 3 164 2 4 35 2 5 6 3 1 147 3 2 658 3 3 181 3 4 41 3 5 8 4 1 90 4 2 469 4 3 236 4 4 50 4 5 16 5 1 53 5 2 414 5 3 306 5 4 106 5 5 30 6 1 44 6 2 267 6 3 284 6 4 98 6 5 20 7 1 20 7 2 136 7 3 157 7 4 66 7 5 17 end label values agegroup agegroup label def agegroup 1 "16-24", modify label def agegroup 2 "25-34", modify label def agegroup 3 "35-44", modify label def agegroup 4 "45-54", modify label def agegroup 5 "55-64", modify label def agegroup 6 "65-74", modify label def agegroup 7 "75+", modify label values health health label def health 1 "very good", modify label def health 2 "good", modify label def health 3 "regular", modify label def health 4 "bad", modify label def health 5 "very bad", modify tabplot health agegroup [w=freq] , percent(agegroup) showval subtitle(% of age group) xtitle("") bfcolor(none)
What particularly bites here are some very small percents, which are
perfectly credible and not at all unusual for such data. A merit of the
multiple bar charts design is that small values are discernible as such.
Note especially the showval option, which insists on showing values too.
The graph thus deliberately uses table ideas and graph ideas together.
Sometimes people say to me, "But you shouldn't do that!" and some
prohibition emerges that graphs are graphs and tables and tables, and
ne'er the twain shall meet, which seems to me no more than superstition.
Digression. An intriguing suggestion, which I have borrowed elsewhere,
is that the conventional distinction between graphs and tables was a
side-effect of the development of printing. Before printing there were
manuscripts -- those scripted manually, or written by hand -- to which
writers could add illustrations, say of knights, or dragons, or of
sinners being tormented, or something equally entertaining, as they
liked and where they liked. Printed documents encouraged, or even
enforced, a division of labour between typesetters and those who
prepared illustrations. But now that's obsolete.
A detailed objection to numeric values too is that they clutter up the
graph, to which the answers are it depends on how you do it, and if
you strongly object it's not compulsory. But tabplot gives up on
labelling axes with bar magnitudes, so that reduces clutter too.
Given this dataset, how else would you represent the patterns
graphically? Setting aside any temptation to draw multiple pie charts,
one alternative is a stacked bar chart:
Code:
* ssc inst catplot needed before catplot health agegroup [w=freq], percent(agegroup) asyvars stack subtitle(% of age group)
differs.
I have not tried to hard to optimise this: the colour scheme and legend both need work,
and so forth. Some would prefer vertical bars here.
The key point is whether it could be made better (clearer, more effective,
more attractive) than the previous graph. I note three key issues:
1. Stacking is a well-understood design but very small amounts are hard to work
to discern.
2. A legend necessarily springs into being, but a legend obliges mental "back
and forth" from readers (or else readers give up on looking at the detail).
3. The program would let you add numeric values on top of the bars, but that would
be at least a little messy.
Naturally this is a straw graph that I set up to knock down again, but are there good
alternatives? I've had better results with unstacked bars for this example, but I
will move on.
Let's look at graphs for a three-way table.
Aitkin et al. (1989, p.242; full reference below) reported data from a
survey of student opinion on the Vietnam War taken at the University of
North Carolina in Chapel Hill in May 1967. Students were classified by
sex, year of study, and the policy they supported, given choices of
A. The United States should defeat the power of North Vietnam by
widespread bombing of its industries, ports, and harbors and by land
invasion.
B. The United States should follow the present policy in Vietnam.
C. The United States should de-escalate its military activity, stop
bombing North Vietnam, and intensify its efforts to begin negotiation.
D. The United States should withdraw its military forces from Vietnam
immediately.
The labels A ... D are fairly dopey, but even at this distance
suggesting better ones might be thought contentious politically, so I
will desist.
Code:
clear input str6 sex str8 year str1 policy int freq "male" "1" "A" 175 "male" "1" "B" 116 "male" "1" "C" 131 "male" "1" "D" 17 "male" "2" "A" 160 "male" "2" "B" 126 "male" "2" "C" 135 "male" "2" "D" 21 "male" "3" "A" 132 "male" "3" "B" 120 "male" "3" "C" 154 "male" "3" "D" 29 "male" "4" "A" 145 "male" "4" "B" 95 "male" "4" "C" 185 "male" "4" "D" 44 "male" "Graduate" "A" 118 "male" "Graduate" "B" 176 "male" "Graduate" "C" 345 "male" "Graduate" "D" 141 "female" "1" "A" 13 "female" "1" "B" 19 "female" "1" "C" 40 "female" "1" "D" 5 "female" "2" "A" 5 "female" "2" "B" 9 "female" "2" "C" 33 "female" "2" "D" 3 "female" "3" "A" 22 "female" "3" "B" 29 "female" "3" "C" 110 "female" "3" "D" 6 "female" "4" "A" 12 "female" "4" "B" 21 "female" "4" "C" 58 "female" "4" "D" 10 "female" "Graduate" "A" 19 "female" "Graduate" "B" 27 "female" "Graduate" "C" 128 "female" "Graduate" "D" 13 end tabplot policy year [w=freq], by(sex, subtitle(% by sex and year, place(w)) note("")) percent(sex year) showval
The way to plot three-way tables is unsurprisingly by using a by() option to repeat two-way tables.
The syntax for tabplot matches standard conventions such that (as in regress and scatter, for
example) it is usually best to mention the response or outcome variable first (as defining rows of
the plot, and as to be shown on the y axis). There can be trade-offs or compromises,
as no layout is best for all purposes, but big differences can safely be put at a distance (so
males and females here differ markedly in their mix of views), while finer distinctions are
easier to make if bars are close. On top of all that, any ordinal scales should naturally be
respected as such.
Aitkin, M., D. Anderson, B. Francis, and J. Hinde. 1989. Statistical
Modelling in GLIM. Oxford: Oxford University Press
Greenacre, M. 2007. Correspondence analysis in practice. Boca Raton, FL:
Chapman & Hall/CRC
Comment