Thanks as always to Kit Baum, an update of groups is available on SSC.
groups is a basic command for listing group frequencies and percents. An early version was
described briefly in Stata Journal 3: 420--439 (2003)
The main underlying idea is that many tables are easily and helpfully presented as lists.
Specifically, the list command is a convenient engine for producing tables in list form.
That command underlies groups. The idea also deserves a good push for other applications
you may have.
More than that: the summary in the opening sentence understates what groups can do. You
can also suppress any display of frequencies and percents. That being so, groups can be
used for almost any table. You may need to calculate beforehand what you want to tabulate,
but that is generally true. All two-way and higher tabulations are reduced to one-way
listings. The argument is that this constraint on layout is not so much of a loss as at
first may appear.
Some simple examples will give the flavour.
First we look at one-way tables. The resemblance between the results of groups foreign and
of tabulate foreign will be clear:
When we turn to two-way tables, some key differences appear.
The leading idea for two-way tables is simple. Instead of a layout of rows by columns
defining cells inside the table, groups gives you a listing in which each item is
row identifier, column identifier, cell variable
The example just given was chosen to make a point. Even with a simple table of 2 rows and
5 columns, tabulate is wrapping output awkwardly. Trivially in this case, you could just
swap variables and have 5 rows and 2 columns. But it is easy to get tables that are
awkward both ways. The major problem is too many columns for comfort. (Too many rows for
comfort is awkward with any command.)
The problem of space is usually compounded with three-way and higher tables. Even if
there is enough space, the sparsity (many zeroes) of some tables often makes other kinds
of tabulation attractive. Broadly, it is with such tables with three or more variables
that groups does especially well. A list structure is clearly general enough to extend to
one or more identifiers, one or more cell variables
By default, groups will not list cross-combinations of two or more identifiers that are
not present in the data. Note how in the last groups output, there are no lines for cars
that are foreign with repair record 1 or 2: as is explicit in the tabulate output, there
are no such cars. Conversely, a fillin option is available:
.
The default for groups is that percents are calculated with reference to all observations
analysed. Thus the 2.90% of observations that are Domestic and have repair record 1 is
2/69 or 2.90% of the observations that have non-missing values on foreign and rep78. So
the percents have as base or denominator the total frequency of all observations reported
in the last table.
In turn you can override the default, and there are two ways to do that.
First: groups can be issued under the aegis of by:, which implies output as a series of
separate tables, and also that percents are calculated separately for each table.
Using by: gives a clear separation of subtables. For some tastes or purposes, the
downside is taking up more space. That leads to the next way to get separate calculation
of percents.
Second: use the percent() option.
The percent() option lets you specify the variable or variables that determine groups for
calculation of percents.
The frequencies and percents shown by default are
1. frequencies and percents for one or more variables in varlist
2. cumulative percents for one variable in varlist.
The surmise is that cumulatives are rather more arbitrary with two or more variables,
being necessarily dependent on the order of variables. That is not the law, however, and
a show() option allows you to have none or one or two or three of those.
.
Indeed, cumulative frequencies are also available on request.
.
Here f stands for frequency and F stands for cumulative frequency. In addition, reverse
cumulatives (# or % > value rather than # or % <= value) are also available. This is not a
complete list, but I should mention show(none), which is one of the most useful choices:
more on that in a moment.
A further option select() lets you select which groups are to be listed, for example by a
condition on the frequencies. select(f == 1) selects those groups that occur precisely
once, in which case there is no need to see a frequency of column of 1s, and the percents
and cumulative percents are possibly of no use or interest:
The select() option can be used in another way. select(5)
says: list just the first five of the groups which would otherwise have
been listed. By default, with just one variable specified, that is just
the five lowest groups of values of the variable. Each group,
naturally, could occur more than once:
You can guess that select(-5) starts at the other end:
So these commands give you pictures of the tails of a distribution. (For single variables,
extremes on SSC offers another way to do something similar.)
You can specify order(high) or order(low). In either case, that option specifies listing
in order of the frequencies, not the values of the variables in each group. In the first
case, select(5) gives you the 5 groups which are most frequent. That gives you a stab at
showing especially common values, otherwise often called modes.
If you specify fillin with two or more variables, cross-combinations with zero
frequencies are shown explicitly. These are in cells that would be shown by 0s by
tabulate or by blanks by table. select()ing zeroes gives you a listing of the cells not
present in your dataset. As such values do not exist in the dataset, we do not need to be
told that their frequencies are all zero, so show(none) can be specified.
groups is just sitting on the shoulders of the giant list, so there are several ways to
tweak appearances. Here is one:
We did get the same appearance earlier, but that was just fortuitous. The default of
list separating every 5 lines happened to give a sensible answer.
groups has further unusual options for table commands. colorder() reorders columns in the
table from what it would have been. Typically this is an option for the second or later
pass at tabulation. Suppose we want to highlight which combinations of categories are
most common.
To impart a little variety, we will look at a different dataset.
We might want the frequencies and percents to be more prominent. Say that existing columns
5 and 6 should be moved to the left. It is sufficient to say colorder(5 6). Specifying 5
6 is equivalent to specifying 5 6 1 2 3 4.
You may have noticed the list option sep(0) being used just now to switch off separator
lines.
Another option that is unusual is saving(), which saves the data being tabulated to a new
dataset. In essence (and minor details aside), they are data, at least temporarily, as
otherwise list would not be able to show them. That should help in tabulation in forms
not supported by groups, for graphics or other analyses, or in export to other software.
There's more, but this post is already long, so I'll just point to the help. A longer version of this
will appear in the Stata Journal.
groups is a basic command for listing group frequencies and percents. An early version was
described briefly in Stata Journal 3: 420--439 (2003)
The main underlying idea is that many tables are easily and helpfully presented as lists.
Specifically, the list command is a convenient engine for producing tables in list form.
That command underlies groups. The idea also deserves a good push for other applications
you may have.
More than that: the summary in the opening sentence understates what groups can do. You
can also suppress any display of frequencies and percents. That being so, groups can be
used for almost any table. You may need to calculate beforehand what you want to tabulate,
but that is generally true. All two-way and higher tabulations are reduced to one-way
listings. The argument is that this constraint on layout is not so much of a loss as at
first may appear.
Some simple examples will give the flavour.
Code:
. sysuse auto, clear
of tabulate foreign will be clear:
Code:
. groups foreign +-------------------------------------+ | foreign Freq. Percent % <= | |-------------------------------------| | Domestic 52 70.27 70.27 | | Foreign 22 29.73 100.00 | +-------------------------------------+ . tabulate foreign Car type | Freq. Percent Cum. ------------+----------------------------------- Domestic | 52 70.27 70.27 Foreign | 22 29.73 100.00 ------------+----------------------------------- Total | 74 100.00
Code:
. groups foreign rep78 +------------------------------------+ | foreign rep78 Freq. Percent | |------------------------------------| | Domestic 1 2 2.90 | | Domestic 2 8 11.59 | | Domestic 3 27 39.13 | | Domestic 4 9 13.04 | | Domestic 5 2 2.90 | |------------------------------------| | Foreign 3 3 4.35 | | Foreign 4 9 13.04 | | Foreign 5 9 13.04 | +------------------------------------+ . tabulate foreign rep78 | Repair Record 1978 Car type | 1 2 3 4 | Total -----------+--------------------------------------------+---------- Domestic | 2 8 27 9 | 48 Foreign | 0 0 3 9 | 21 -----------+--------------------------------------------+---------- Total | 2 8 30 18 | 69 | Repair | Record | 1978 Car type | 5 | Total -----------+-----------+---------- Domestic | 2 | 48 Foreign | 9 | 21 -----------+-----------+---------- Total | 11 | 69
defining cells inside the table, groups gives you a listing in which each item is
row identifier, column identifier, cell variable
The example just given was chosen to make a point. Even with a simple table of 2 rows and
5 columns, tabulate is wrapping output awkwardly. Trivially in this case, you could just
swap variables and have 5 rows and 2 columns. But it is easy to get tables that are
awkward both ways. The major problem is too many columns for comfort. (Too many rows for
comfort is awkward with any command.)
The problem of space is usually compounded with three-way and higher tables. Even if
there is enough space, the sparsity (many zeroes) of some tables often makes other kinds
of tabulation attractive. Broadly, it is with such tables with three or more variables
that groups does especially well. A list structure is clearly general enough to extend to
one or more identifiers, one or more cell variables
By default, groups will not list cross-combinations of two or more identifiers that are
not present in the data. Note how in the last groups output, there are no lines for cars
that are foreign with repair record 1 or 2: as is explicit in the tabulate output, there
are no such cars. Conversely, a fillin option is available:
.
Code:
groups foreign rep78, fillin +------------------------------------+ | foreign rep78 Freq. Percent | |------------------------------------| | Domestic 1 2 2.90 | | Domestic 2 8 11.59 | | Domestic 3 27 39.13 | | Domestic 4 9 13.04 | | Domestic 5 2 2.90 | |------------------------------------| | Foreign 1 0 0.00 | | Foreign 2 0 0.00 | | Foreign 3 3 4.35 | | Foreign 4 9 13.04 | | Foreign 5 9 13.04 | +------------------------------------+
analysed. Thus the 2.90% of observations that are Domestic and have repair record 1 is
2/69 or 2.90% of the observations that have non-missing values on foreign and rep78. So
the percents have as base or denominator the total frequency of all observations reported
in the last table.
In turn you can override the default, and there are two ways to do that.
First: groups can be issued under the aegis of by:, which implies output as a series of
separate tables, and also that percents are calculated separately for each table.
Code:
. bysort foreign: groups rep78 ------------------------------------------------------------------------------ -> foreign = Domestic +----------------------------------+ | rep78 Freq. Percent %<= | |----------------------------------| | 1 2 4.17 4.17 | | 2 8 16.67 20.83 | | 3 27 56.25 77.08 | | 4 9 18.75 95.83 | | 5 2 4.17 100.00 | +----------------------------------+ ------------------------------------------------------------------------------ -> foreign = Foreign +----------------------------------+ | rep78 Freq. Percent %<= | |----------------------------------| | 3 3 14.29 14.29 | | 4 9 42.86 57.14 | | 5 9 42.86 100.00 | +----------------------------------+
downside is taking up more space. That leads to the next way to get separate calculation
of percents.
Second: use the percent() option.
Code:
. groups foreign rep78, percent(foreign) +------------------------------------+ | foreign rep78 Freq. Percent | |------------------------------------| | Domestic 1 2 4.17 | | Domestic 2 8 16.67 | | Domestic 3 27 56.25 | | Domestic 4 9 18.75 | | Domestic 5 2 4.17 | |------------------------------------| | Foreign 3 3 14.29 | | Foreign 4 9 42.86 | | Foreign 5 9 42.86 | +------------------------------------+
calculation of percents.
The frequencies and percents shown by default are
1. frequencies and percents for one or more variables in varlist
2. cumulative percents for one variable in varlist.
The surmise is that cumulatives are rather more arbitrary with two or more variables,
being necessarily dependent on the order of variables. That is not the law, however, and
a show() option allows you to have none or one or two or three of those.
.
Code:
groups foreign rep78, percent(foreign) show(f p P) +---------------------------------------------+ | foreign rep78 Freq. Percent %<= | |---------------------------------------------| | Domestic 1 2 4.17 4.17 | | Domestic 2 8 16.67 20.83 | | Domestic 3 27 56.25 77.08 | | Domestic 4 9 18.75 95.83 | | Domestic 5 2 4.17 100.00 | |---------------------------------------------| | Foreign 3 3 14.29 14.29 | | Foreign 4 9 42.86 57.14 | | Foreign 5 9 42.86 100.00 | +---------------------------------------------+
.
Code:
groups mpg, show(f F) +-------------------+ | mpg Freq. #<= | |-------------------| | 12 2 2 | | 14 6 8 | | 15 2 10 | | 16 4 14 | | 17 4 18 | |-------------------| | 18 9 27 | | 19 8 35 | | 20 3 38 | | 21 5 43 | | 22 5 48 | |-------------------| | 23 3 51 | | 24 4 55 | | 25 5 60 | | 26 3 63 | | 28 3 66 | |-------------------| | 29 1 67 | | 30 2 69 | | 31 1 70 | | 34 1 71 | | 35 2 73 | |-------------------| | 41 1 74 | +-------------------+
cumulatives (# or % > value rather than # or % <= value) are also available. This is not a
complete list, but I should mention show(none), which is one of the most useful choices:
more on that in a moment.
A further option select() lets you select which groups are to be listed, for example by a
condition on the frequencies. select(f == 1) selects those groups that occur precisely
once, in which case there is no need to see a frequency of column of 1s, and the percents
and cumulative percents are possibly of no use or interest:
Code:
. groups mpg, select(f == 1) show(none) +-----+ | mpg | |-----| | 29 | | 31 | | 34 | | 41 | +-----+
says: list just the first five of the groups which would otherwise have
been listed. By default, with just one variable specified, that is just
the five lowest groups of values of the variable. Each group,
naturally, could occur more than once:
Code:
. groups mpg, select(5) +-------------------------------+ | mpg Freq. Percent %<= | |-------------------------------| | 12 2 2.70 2.70 | | 14 6 8.11 10.81 | | 15 2 2.70 13.51 | | 16 4 5.41 18.92 | | 17 4 5.41 24.32 | +-------------------------------+
Code:
. groups mpg, select(-5) +--------------------------------+ | mpg Freq. Percent %<= | |--------------------------------| | 30 2 2.70 93.24 | | 31 1 1.35 94.59 | | 34 1 1.35 95.95 | | 35 2 2.70 98.65 | | 41 1 1.35 100.00 | +--------------------------------+
extremes on SSC offers another way to do something similar.)
You can specify order(high) or order(low). In either case, that option specifies listing
in order of the frequencies, not the values of the variables in each group. In the first
case, select(5) gives you the 5 groups which are most frequent. That gives you a stab at
showing especially common values, otherwise often called modes.
Code:
. groups mpg, select(5) order(h) +-------------------------------+ | mpg Freq. Percent %<= | |-------------------------------| | 18 9 12.16 12.16 | | 19 8 10.81 22.97 | | 14 6 8.11 31.08 | | 21 5 6.76 37.84 | | 22 5 6.76 44.59 | +-------------------------------+
frequencies are shown explicitly. These are in cells that would be shown by 0s by
tabulate or by blanks by table. select()ing zeroes gives you a listing of the cells not
present in your dataset. As such values do not exist in the dataset, we do not need to be
told that their frequencies are all zero, so show(none) can be specified.
Code:
. groups foreign rep78, fillin select(f == 0) show(none) +-----------------+ | foreign rep78 | |-----------------| | Foreign 1 | | Foreign 2 | +-----------------+
tweak appearances. Here is one:
Code:
. groups foreign rep78, sepby(foreign) +------------------------------------+ | foreign rep78 Freq. Percent | |------------------------------------| | Domestic 1 2 2.90 | | Domestic 2 8 11.59 | | Domestic 3 27 39.13 | | Domestic 4 9 13.04 | | Domestic 5 2 2.90 | |------------------------------------| | Foreign 3 3 4.35 | | Foreign 4 9 13.04 | | Foreign 5 9 13.04 | +------------------------------------+
list separating every 5 lines happened to give a sensible answer.
groups has further unusual options for table commands. colorder() reorders columns in the
table from what it would have been. Typically this is an option for the second or later
pass at tabulation. Suppose we want to highlight which combinations of categories are
most common.
To impart a little variety, we will look at a different dataset.
Code:
. webuse nlswork (National Longitudinal Survey. Young Women 14-26 years of age in 1968) . groups collgrad not_smsa c_city south, order(high) sep(0) +--------------------------------------------------------+ | collgrad not_smsa c_city south Freq. Percent | |--------------------------------------------------------| | 0 0 0 0 5742 20.13 | | 0 0 1 0 4941 17.32 | | 0 1 0 1 3982 13.96 | | 0 0 1 1 3455 12.11 | | 0 1 0 0 3086 10.82 | | 0 0 0 1 2527 8.86 | | 1 0 0 0 1412 4.95 | | 1 0 1 0 1096 3.84 | | 1 0 1 1 698 2.45 | | 1 0 0 1 598 2.10 | | 1 1 0 0 566 1.98 | | 1 1 0 1 423 1.48 | +--------------------------------------------------------+
5 and 6 should be moved to the left. It is sufficient to say colorder(5 6). Specifying 5
6 is equivalent to specifying 5 6 1 2 3 4.
Code:
. groups collgrad not_smsa c_city south, order(high) sep(0) colorder(5 6) +--------------------------------------------------------+ | Freq. Percent collgrad not_smsa c_city south | |--------------------------------------------------------| | 5742 20.13 0 0 0 0 | | 4941 17.32 0 0 1 0 | | 3982 13.96 0 1 0 1 | | 3455 12.11 0 0 1 1 | | 3086 10.82 0 1 0 0 | | 2527 8.86 0 0 0 1 | | 1412 4.95 1 0 0 0 | | 1096 3.84 1 0 1 0 | | 698 2.45 1 0 1 1 | | 598 2.10 1 0 0 1 | | 566 1.98 1 1 0 0 | | 423 1.48 1 1 0 1 | +--------------------------------------------------------+
lines.
Another option that is unusual is saving(), which saves the data being tabulated to a new
dataset. In essence (and minor details aside), they are data, at least temporarily, as
otherwise list would not be able to show them. That should help in tabulation in forms
not supported by groups, for graphics or other analyses, or in export to other software.
There's more, but this post is already long, so I'll just point to the help. A longer version of this
will appear in the Stata Journal.
Comment