Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is there a way in Stata to drop cases if they have greater than a certain percent missing in data across variables?

    Still learning stata, I have a longitudinal data set with several waves and in some of the later waves many of the data is missing for various reasons but apparently completely at random. I wanted to limit model to only cases with say less than 20% missing.

  • #2
    Several ways, depending on how your data is structured. See "help egen" and look for rowmiss/rownonmiss if your data is wide; egen count if your variable(s) are long.

    Comment


    • #3
      If I understand this question, you have varying numbers of observation per id and for any id some of the observations have missing data. If that is the case, you need to do the following:

      1) Count the number of records per id
      2) Count how many of those records contain missing data on the variable of interest
      3) Calculate the percent of missing data per id
      4) Drop all records for any id with more than 20% missing data.

      Here is an example of how to do that. There are more efficient ways but I have done it as a series of discrete steps so you can see what is going on.


      Code:
      .
      *show constructed toy data set
      list, sepby(id)
      
           +--------+
           | id   y |
           |--------|
        1. |  1   3 |
        2. |  1   . |
        3. |  1   3 |
        4. |  1   0 |
        5. |  1   . |
        6. |  1   . |
        7. |  1   . |
           |--------|
        8. |  2   9 |
        9. |  2   7 |
           |--------|
       10. |  3   . |
           |--------|
       11. |  4   9 |
       12. |  4   5 |
       13. |  4   6 |
       14. |  4   5 |
       15. |  4   9 |
           |--------|
       16. |  5   1 |
       17. |  5   2 |
       18. |  5   3 |
       19. |  5   4 |
       20. |  5   5 |
       21. |  5   . |
           |--------|
       22. |  6   . |
       23. |  6   . |
           |--------|
       24. |  7   1 |
       25. |  7   2 |
       26. |  7   3 |
       27. |  7   4 |
           +--------+
      
      . * get case count per id
      . bysort id: gen cases = _N
      
      . *count number of missing records per id
      . by id: egen nvalid = count(y)
      
      . *compute proportion of missing data per id
      . by id: gen prop_valid = nvalid/cases
      
      . list ,sepby(id)
      
           +------------------------------------+
           | id   y   cases   nvalid   prop_v~d |
           |------------------------------------|
        1. |  1   3       7        3   .4285714 |
        2. |  1   .       7        3   .4285714 |
        3. |  1   3       7        3   .4285714 |
        4. |  1   0       7        3   .4285714 |
        5. |  1   .       7        3   .4285714 |
        6. |  1   .       7        3   .4285714 |
        7. |  1   .       7        3   .4285714 |
           |------------------------------------|
        8. |  2   9       2        2          1 |
        9. |  2   7       2        2          1 |
           |------------------------------------|
       10. |  3   .       1        0          0 |
           |------------------------------------|
       11. |  4   9       5        5          1 |
       12. |  4   5       5        5          1 |
       13. |  4   6       5        5          1 |
       14. |  4   5       5        5          1 |
       15. |  4   9       5        5          1 |
           |------------------------------------|
       16. |  5   1       6        5   .8333333 |
       17. |  5   2       6        5   .8333333 |
       18. |  5   3       6        5   .8333333 |
       19. |  5   4       6        5   .8333333 |
       20. |  5   5       6        5   .8333333 |
       21. |  5   .       6        5   .8333333 |
           |------------------------------------|
       22. |  6   .       2        0          0 |
       23. |  6   .       2        0          0 |
           |------------------------------------|
       24. |  7   1       4        4          1 |
       25. |  7   2       4        4          1 |
       26. |  7   3       4        4          1 |
       27. |  7   4       4        4          1 |
           +------------------------------------+
      
      . drop if prop_valid < .8
      (10 observations deleted)
      
      . list, sepby(id)
      
           +------------------------------------+
           | id   y   cases   nvalid   prop_v~d |
           |------------------------------------|
        1. |  2   9       2        2          1 |
        2. |  2   7       2        2          1 |
           |------------------------------------|
        3. |  4   9       5        5          1 |
        4. |  4   5       5        5          1 |
        5. |  4   6       5        5          1 |
        6. |  4   5       5        5          1 |
        7. |  4   9       5        5          1 |
           |------------------------------------|
        8. |  5   1       6        5   .8333333 |
        9. |  5   2       6        5   .8333333 |
       10. |  5   3       6        5   .8333333 |
       11. |  5   4       6        5   .8333333 |
       12. |  5   5       6        5   .8333333 |
       13. |  5   .       6        5   .8333333 |
           |------------------------------------|
       14. |  7   1       4        4          1 |
       15. |  7   2       4        4          1 |
       16. |  7   3       4        4          1 |
       17. |  7   4       4        4          1 |
           +------------------------------------+


      .
      Richard T. Campbell
      Emeritus Professor of Biostatistics and Sociology
      University of Illinois at Chicago

      Comment

      Working...
      X