Is there a way in Stata to drop cases if they have greater than a certain percent missing in data across variables?

Wyatt Brown

Join Date: May 2015

Posts: 1
#1

Is there a way in Stata to drop cases if they have greater than a certain percent missing in data across variables?

12 May 2015, 15:30

Still learning stata, I have a longitudinal data set with several waves and in some of the later waves many of the data is missing for various reasons but apparently completely at random. I wanted to limit model to only cases with say less than 20% missing.
Tags: None
ben earnhart

Join Date: May 2014

Posts: 1027
#2

12 May 2015, 17:34

Several ways, depending on how your data is structured. See "help egen" and look for rowmiss/rownonmiss if your data is wide; egen count if your variable(s) are long.
Comment

Dick Campbell

Join Date: Apr 2014
Posts: 279

13 May 2015, 08:57

If I understand this question, you have varying numbers of observation per id and for any id some of the observations have missing data. If that is the case, you need to do the following:

1) Count the number of records per id
2) Count how many of those records contain missing data on the variable of interest
3) Calculate the percent of missing data per id
4) Drop all records for any id with more than 20% missing data.

Here is an example of how to do that. There are more efficient ways but I have done it as a series of discrete steps so you can see what is going on.

Code:

.
*show constructed toy data set
list, sepby(id)

     +--------+
     | id   y |
     |--------|
  1. |  1   3 |
  2. |  1   . |
  3. |  1   3 |
  4. |  1   0 |
  5. |  1   . |
  6. |  1   . |
  7. |  1   . |
     |--------|
  8. |  2   9 |
  9. |  2   7 |
     |--------|
 10. |  3   . |
     |--------|
 11. |  4   9 |
 12. |  4   5 |
 13. |  4   6 |
 14. |  4   5 |
 15. |  4   9 |
     |--------|
 16. |  5   1 |
 17. |  5   2 |
 18. |  5   3 |
 19. |  5   4 |
 20. |  5   5 |
 21. |  5   . |
     |--------|
 22. |  6   . |
 23. |  6   . |
     |--------|
 24. |  7   1 |
 25. |  7   2 |
 26. |  7   3 |
 27. |  7   4 |
     +--------+

. * get case count per id
. bysort id: gen cases = _N

. *count number of missing records per id
. by id: egen nvalid = count(y)

. *compute proportion of missing data per id
. by id: gen prop_valid = nvalid/cases

. list ,sepby(id)

     +------------------------------------+
     | id   y   cases   nvalid   prop_v~d |
     |------------------------------------|
  1. |  1   3       7        3   .4285714 |
  2. |  1   .       7        3   .4285714 |
  3. |  1   3       7        3   .4285714 |
  4. |  1   0       7        3   .4285714 |
  5. |  1   .       7        3   .4285714 |
  6. |  1   .       7        3   .4285714 |
  7. |  1   .       7        3   .4285714 |
     |------------------------------------|
  8. |  2   9       2        2          1 |
  9. |  2   7       2        2          1 |
     |------------------------------------|
 10. |  3   .       1        0          0 |
     |------------------------------------|
 11. |  4   9       5        5          1 |
 12. |  4   5       5        5          1 |
 13. |  4   6       5        5          1 |
 14. |  4   5       5        5          1 |
 15. |  4   9       5        5          1 |
     |------------------------------------|
 16. |  5   1       6        5   .8333333 |
 17. |  5   2       6        5   .8333333 |
 18. |  5   3       6        5   .8333333 |
 19. |  5   4       6        5   .8333333 |
 20. |  5   5       6        5   .8333333 |
 21. |  5   .       6        5   .8333333 |
     |------------------------------------|
 22. |  6   .       2        0          0 |
 23. |  6   .       2        0          0 |
     |------------------------------------|
 24. |  7   1       4        4          1 |
 25. |  7   2       4        4          1 |
 26. |  7   3       4        4          1 |
 27. |  7   4       4        4          1 |
     +------------------------------------+

. drop if prop_valid < .8
(10 observations deleted)

. list, sepby(id)

     +------------------------------------+
     | id   y   cases   nvalid   prop_v~d |
     |------------------------------------|
  1. |  2   9       2        2          1 |
  2. |  2   7       2        2          1 |
     |------------------------------------|
  3. |  4   9       5        5          1 |
  4. |  4   5       5        5          1 |
  5. |  4   6       5        5          1 |
  6. |  4   5       5        5          1 |
  7. |  4   9       5        5          1 |
     |------------------------------------|
  8. |  5   1       6        5   .8333333 |
  9. |  5   2       6        5   .8333333 |
 10. |  5   3       6        5   .8333333 |
 11. |  5   4       6        5   .8333333 |
 12. |  5   5       6        5   .8333333 |
 13. |  5   .       6        5   .8333333 |
     |------------------------------------|
 14. |  7   1       4        4          1 |
 15. |  7   2       4        4          1 |
 16. |  7   3       4        4          1 |
 17. |  7   4       4        4          1 |
     +------------------------------------+

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago

Announcement

Is there a way in Stata to drop cases if they have greater than a certain percent missing in data across variables?

Comment

Comment