In a large dataset, How to Identity variables with more than 90% missing cases?

ishak celik

Join Date: Nov 2021

Posts: 11
#1

In a large dataset, How to Identity variables with more than 90% missing cases?

22 Feb 2022, 05:54

IIn a large dataset with 1000 variables and 50000 cases, how can we quickly identify the variables with highest percentages with missing values, is there a formula for this? thank you...

Last edited by ishak celik; 22 Feb 2022, 06:06.
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35211

22 Feb 2022, 06:03

One possibility is missings from the Stata Journal. In this example, I focus on >20% missing but the principle's the same.

Code:

. webuse nlswork
(National Longitudinal Survey of Young Women, 14-24 years old in 1968)

. missings report

Checking missings in all variables:
15082 observations with missing values

-------------------
          |      #
----------+--------
      age |     24
      msp |     16
  nev_mar |     16
    grade |      2
 not_smsa |      8
   c_city |      8
    south |      8
 ind_code |    341
 occ_code |    121
    union |   9296
   wks_ue |   5704
   tenure |    433
    hours |     67
 wks_work |    703
-------------------

. count
  28,534

. di 0.2 * r(N)
5706.8

. missings report, sort min(5707)

Checking missings in all variables:
15082 observations with missing values

----------------
       |      #
-------+--------
 union |   9296
----------------

missings is a frequent keyword in searchers. I didn't want to use a long name for the command, so the downside is that the code is a little hard to find, except that I compensate here by giving an otherwise unpredictable identifier:

Code:

. search dm0085, entry

Search of official help files, FAQs, Examples, and Stata Journals

SJ-20-4 dm0085_2  . . . . . . . . . . . . . . . . Software update for missings
        (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q4/20   SJ 20(4):1028--1030
        sorting has been extended for missings report

SJ-17-3 dm0085_1  . . . . . . . . . . . . . . . . Software update for missings
        (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q3/17   SJ 17(3):779
        identify() and sort options have been added

SJ-15-4 dm0085  Speaking Stata: A set of utilities for managing missing values
        (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q4/15   SJ 15(4):1174--1185
        provides command, missings, as a replacement for, and extension
        of, previous commands nmissing and dropmiss

The 2015 paper remains pertinent, but the Software Update gives the latest code.

Comment

Øyvind Snilsberg

Join Date: Oct 2021
Posts: 591

22 Feb 2022, 06:11

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(v1 v2 v3 v4)
1 1 1 1
. 1 1 1
. . 1 1
. . 1 1
. . 1 1
. . 1 1
. . . 1
. . . 1
. . . 1
. . . 1
. . . .
end

ds
foreach v in `r(varlist)' {
count if missing(`v')
if r(N) > _N*0.9 {
drop `v'
}
}

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35211
#4

22 Feb 2022, 06:19

The code in #3 drops such variables. but it doesn't really provide a good record of a ranking on % of missing values.
Comment
ishak celik

Join Date: Nov 2021

Posts: 11
#5

23 Feb 2022, 03:17

thank you!
Comment

Announcement