Thanks to Kit Baum as usual, a new program missings is now available from SSC.
Use
to install.
Stata version 9 up is required (but see below for a note for any people on
version 9 itself).
missings combines various basic utilities for managing variables that
(may) have missing values. The syntax starts
missings subcommand ...
with a choice of six subcommands. By default "missing" means numeric
missing (that is, sysmiss . or one of the extended missing values .a to
.z) for numeric variables and empty or "" for string variables.
If a varlist is not specified, then it is interpreted by default as all variables.
missings report issues a report on the number of missing values in
varlist. By default counts of missings are given by variables;
optionally counts are given by observations.
missings list lists observations with missing values in varlist.
missings table tabulates observations by the number of missing values in
varlist.
missings tag generates a variable containing the number of missing
values in each observation on varlist.
missings dropvars drops any variables in varlist which are missing on
all values.
missings dropobs drops any observations which are missing on all values
in varlist.
missings is intended to unite and supersede the main ideas of nmissing
(Cox 1999, 2001a, 2003, 2005) and dropmiss (Cox 2001b, 2008), together
with a more up-to-date take on what may be useful given other commands.
Now creating entirely empty observations (rows) and/or variables
(columns) is a habit of many spreadsheet users, but neither is helpful
in Stata datasets. The subcommands dropobs and dropvars should help
users clean up.
Conversely, there is no explicit support in missings for dropping
observations or variables with some missing and some non-missing values.
Users so minded will find other subcommands of use as an intermediate
step, but multiple imputation might be a better way forward.
Let's look at a few examples. We can read in a fairly complicated
dataset and get a concise report on missings. We can specify a threshold
to see which variables look particularly poor.
We can home in on the observations with missing values. Again, a
threshold can be specified:
Another basic subcommand reports counts of missings across all variables
specified, or alternatively all variables:
Where appropriate, a by: prefix can be specified to get a breakdown:
It's sometimes helpful to have a count of missings in a variable. Many
people will know that there is already an egen function for this, but oh
well.
This dataset does not have any variables or observations that are
entirely missing, but let's see how missings could be used to remove
such.
There are various safety features here. Most obviously, if the dataset
in memory has not been saved, then you must show awareness that you are
exerting brute force. In addition, extended missing values .a to .z are
supposed to be informative, so you can specify sysmiss to spell out that
only system missings can be dropped.
Cox, N.J. 1999. Numbers of missing and present values. Stata Technical
Bulletin 49: 7-8.
2001a. Enhancements to numbers of missing and present values. Stata Technical
Bulletin 60: 2-3.
2001b. Dropping variables or observations with missing values. Stata Technical
Bulletin 60: 7-8.
2003. Numbers of present and missing values. Stata Journal 3: 449.
2005. Numbers of present and missing values. Stata Journal 5: 607.
2008. Dropping variables or observations with missing values. Stata Journal 8: 594.
(So what about users of version 9? The code works fine in those
versions. The problem is just that you need to rename the help file from
.sthlp to .hlp.)
Use
Code:
ssc inst missings
Stata version 9 up is required (but see below for a note for any people on
version 9 itself).
missings combines various basic utilities for managing variables that
(may) have missing values. The syntax starts
missings subcommand ...
with a choice of six subcommands. By default "missing" means numeric
missing (that is, sysmiss . or one of the extended missing values .a to
.z) for numeric variables and empty or "" for string variables.
If a varlist is not specified, then it is interpreted by default as all variables.
missings report issues a report on the number of missing values in
varlist. By default counts of missings are given by variables;
optionally counts are given by observations.
missings list lists observations with missing values in varlist.
missings table tabulates observations by the number of missing values in
varlist.
missings tag generates a variable containing the number of missing
values in each observation on varlist.
missings dropvars drops any variables in varlist which are missing on
all values.
missings dropobs drops any observations which are missing on all values
in varlist.
missings is intended to unite and supersede the main ideas of nmissing
(Cox 1999, 2001a, 2003, 2005) and dropmiss (Cox 2001b, 2008), together
with a more up-to-date take on what may be useful given other commands.
Now creating entirely empty observations (rows) and/or variables
(columns) is a habit of many spreadsheet users, but neither is helpful
in Stata datasets. The subcommands dropobs and dropvars should help
users clean up.
Conversely, there is no explicit support in missings for dropping
observations or variables with some missing and some non-missing values.
Users so minded will find other subcommands of use as an intermediate
step, but multiple imputation might be a better way forward.
Let's look at a few examples. We can read in a fairly complicated
dataset and get a concise report on missings. We can specify a threshold
to see which variables look particularly poor.
Code:
. webuse nlswork, clear (National Longitudinal Survey. Young Women 14-26 years of age in 1968) . missings report Checking missings in all variables: 15082 observations with missing values age 24 msp 16 nev_mar 16 grade 2 not_smsa 8 c_city 8 south 8 ind_code 341 occ_code 121 union 9296 wks_ue 5704 tenure 433 hours 67 wks_work 703 . missings report, min(1000) Checking missings in all variables: 15082 observations with missing values union 9296 wks_ue 5704
threshold can be specified:
Code:
. missings list, min(5) Checking missings in all variables: 15082 observations with missing values +------------------------------------------------------------+ 6924. | age | msp | nev_mar | not_smsa | c_city | south | ind_code | | 26 | . | . | 0 | 1 | 0 | 11 | |------------------------------------------------------------| | occ_code | union | wks_ue | tenure | hours | wks_work | | 8 | . | . | .0833333 | 40 | . | +------------------------------------------------------------+ +------------------------------------------------------------+ 19002. | age | msp | nev_mar | not_smsa | c_city | south | ind_code | | 31 | . | . | 0 | 1 | 1 | 7 | |------------------------------------------------------------| | occ_code | union | wks_ue | tenure | hours | wks_work | | 8 | . | . | .3333333 | . | . | +------------------------------------------------------------+ +------------------------------------------------------------+ 21220. | age | msp | nev_mar | not_smsa | c_city | south | ind_code | | 26 | 1 | 0 | . | . | . | 7 | |------------------------------------------------------------| | occ_code | union | wks_ue | tenure | hours | wks_work | | 3 | . | . | .0833333 | 40 | . | +------------------------------------------------------------+ +------------------------------------------------------------+ 22493. | age | msp | nev_mar | not_smsa | c_city | south | ind_code | | 32 | 1 | 0 | 1 | 0 | 0 | . | |------------------------------------------------------------| | occ_code | union | wks_ue | tenure | hours | wks_work | | . | . | . | .75 | . | . | +------------------------------------------------------------+ +------------------------------------------------------------+ 22628. | age | msp | nev_mar | not_smsa | c_city | south | ind_code | | 39 | 0 | 0 | 0 | 1 | 0 | . | |------------------------------------------------------------| | occ_code | union | wks_ue | tenure | hours | wks_work | | 3 | . | . | 1.5 | . | . | +------------------------------------------------------------+
specified, or alternatively all variables:
Code:
. missings table Checking missings in all variables: 15082 observations with missing values # of | missing | values | Freq. Percent Cum. ------------+----------------------------------- 0 | 13,452 47.14 47.14 1 | 13,790 48.33 95.47 2 | 964 3.38 98.85 3 | 291 1.02 99.87 4 | 32 0.11 99.98 5 | 2 0.01 99.99 6 | 3 0.01 100.00 ------------+----------------------------------- Total | 28,534 100.00
Code:
. bysort race: missings table ------------------------------------------------------------------------------- -> race = white Checking missings in all variables: 10576 observations with missing values # of | missing | values | Freq. Percent Cum. ------------+----------------------------------- 0 | 9,604 47.59 47.59 1 | 9,672 47.93 95.52 2 | 677 3.35 98.88 3 | 199 0.99 99.86 4 | 25 0.12 99.99 5 | 1 0.00 99.99 6 | 2 0.01 100.00 ------------+----------------------------------- Total | 20,180 100.00 ------------------------------------------------------------------------------- -> race = black Checking missings in all variables: 4342 observations with missing values # of | missing | values | Freq. Percent Cum. ------------+----------------------------------- 0 | 3,709 46.07 46.07 1 | 3,966 49.26 95.33 2 | 278 3.45 98.78 3 | 89 1.11 99.89 4 | 7 0.09 99.98 5 | 1 0.01 99.99 6 | 1 0.01 100.00 ------------+----------------------------------- Total | 8,051 100.00 ------------------------------------------------------------------------------- -> race = other Checking missings in all variables: 164 observations with missing values # of | missing | values | Freq. Percent Cum. ------------+----------------------------------- 0 | 139 45.87 45.87 1 | 152 50.17 96.04 2 | 9 2.97 99.01 3 | 3 0.99 100.00 ------------+----------------------------------- Total | 303 100.00
people will know that there is already an egen function for this, but oh
well.
Code:
. missings tag, generate(nmissing) Checking missings in all variables: 15082 observations with missing values
entirely missing, but let's see how missings could be used to remove
such.
Code:
. gen newt = "" (28,534 missing values generated) . gen frog = . (28,534 missing values generated) . gen toad = .a (28,534 missing values generated) . missings dropvars newt frog toad, force sysmiss Checking missings in newt frog toad: 28534 observations with system missing values note: newt frog dropped . missings dropvars toad, force sysmiss Checking missings in toad: 0 observations with system missing values note: no variables qualify . missings dropvars toad, force Checking missings in toad: 28534 observations with missing values note: toad dropped . set obs 30000 number of observations (_N) was 28,534, now 30,000 . missings dropobs, force Checking missings in idcode year birth_yr age race msp nev_mar grade collgrad not_smsa c_city south ind_code occ_code union wks_ue ttl_exp tenure hours wks_work ln_wage nmissing: 16548 observations with missing values (1,466 observations deleted)
in memory has not been saved, then you must show awareness that you are
exerting brute force. In addition, extended missing values .a to .z are
supposed to be informative, so you can specify sysmiss to spell out that
only system missings can be dropped.
Cox, N.J. 1999. Numbers of missing and present values. Stata Technical
Bulletin 49: 7-8.
2001a. Enhancements to numbers of missing and present values. Stata Technical
Bulletin 60: 2-3.
2001b. Dropping variables or observations with missing values. Stata Technical
Bulletin 60: 7-8.
2003. Numbers of present and missing values. Stata Journal 3: 449.
2005. Numbers of present and missing values. Stata Journal 5: 607.
2008. Dropping variables or observations with missing values. Stata Journal 8: 594.
(So what about users of version 9? The code works fine in those
versions. The problem is just that you need to rename the help file from
.sthlp to .hlp.)
Comment