Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • In a large dataset, How to Identity variables with more than 90% missing cases?

    IIn a large dataset with 1000 variables and 50000 cases, how can we quickly identify the variables with highest percentages with missing values, is there a formula for this? thank you...
    Last edited by ishak celik; 22 Feb 2022, 06:06.

  • #2
    One possibility is missings from the Stata Journal. In this example, I focus on >20% missing but the principle's the same.

    Code:
    . webuse nlswork
    (National Longitudinal Survey of Young Women, 14-24 years old in 1968)
    
    . missings report
    
    Checking missings in all variables:
    15082 observations with missing values
    
    -------------------
              |      #
    ----------+--------
          age |     24
          msp |     16
      nev_mar |     16
        grade |      2
     not_smsa |      8
       c_city |      8
        south |      8
     ind_code |    341
     occ_code |    121
        union |   9296
       wks_ue |   5704
       tenure |    433
        hours |     67
     wks_work |    703
    -------------------
    
    . count
      28,534
    
    . di 0.2 * r(N)
    5706.8
    
    . missings report, sort min(5707)
    
    Checking missings in all variables:
    15082 observations with missing values
    
    ----------------
           |      #
    -------+--------
     union |   9296
    ----------------
    missings is a frequent keyword in searchers. I didn't want to use a long name for the command, so the downside is that the code is a little hard to find, except that I compensate here by giving an otherwise unpredictable identifier:


    Code:
    . search dm0085, entry
    
    Search of official help files, FAQs, Examples, and Stata Journals
    
    SJ-20-4 dm0085_2  . . . . . . . . . . . . . . . . Software update for missings
            (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
            Q4/20   SJ 20(4):1028--1030
            sorting has been extended for missings report
    
    SJ-17-3 dm0085_1  . . . . . . . . . . . . . . . . Software update for missings
            (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
            Q3/17   SJ 17(3):779
            identify() and sort options have been added
    
    SJ-15-4 dm0085  Speaking Stata: A set of utilities for managing missing values
            (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
            Q4/15   SJ 15(4):1174--1185
            provides command, missings, as a replacement for, and extension
            of, previous commands nmissing and dropmiss
    The 2015 paper remains pertinent, but the Software Update gives the latest code.

    Comment


    • #3
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input float(v1 v2 v3 v4)
      1 1 1 1
      . 1 1 1
      . . 1 1
      . . 1 1
      . . 1 1
      . . 1 1
      . . . 1
      . . . 1
      . . . 1
      . . . 1
      . . . .
      end
      
      ds
      foreach v in `r(varlist)' {
      count if missing(`v')
      if r(N) > _N*0.9 {
      drop `v'
      }
      }

      Comment


      • #4
        The code in #3 drops such variables. but it doesn't really provide a good record of a ranking on % of missing values.

        Comment


        • #5
          thank you!

          Comment

          Working...
          X