Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Drop duplicates with highest value and keep if missing

    Hi,

    I have been searching the forum and found some variations of code which I have tried to use, but I can't get it to properly keep the right observations.

    I have a dataset with 3260 observations, out of which 26 are duplicates (aka 3234 distinct IDs). I have a variable, creatinin, which can contain either a value or missing.
    - If the patient has 2 observations and both have missing values; I want to keep the missing value, but only one observation.
    - ff the patient has 1 observation with missing and 1 with a creatinin value; I want to keep the one with the creatinin.
    - If the patient has 2 observations with creatinin values, I want to keep the lowest of the two.

    I have tried:
    Code:
    bysort id_code: egen wanted=min(creatinin)
    gen OK = !missing(creatinin)
    bysort OK id_code (creatinin) : keep if OK & _n == _N
    Which deleted 390 observations.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str16 id_code double creatinin
    "B0020"   1.8
    "B0020"   3.26
    "B0021"   .
    "B0021"   1.88
    "B0045"   .
    "B0045"   .
    "B0077"   .
    "B0077"   2.1
    "B0081"   4.8
    "B0081"   .
    "B2002"   2.2
    "B2042"   1.7
    "B2042"   1.9
    end
    Any help would be greatly appreciated!

  • #2
    Not at my computer, but, use duplicates to create an indicator for the patientid variable, then use egen to denote which duplicate value is the maximum value, then drop the maximum value and you'll have the missing ones. If someone doesn't beat me to it, I'll look when I'm home from the doctor's.

    Comment


    • #3
      Code:
      bys id_code (creatinin): gen byte to_drop = (_N>1 & _n != 1)
      which produces:
      Code:
      . li , noobs sepby(id_code)
      
        +------------------------------+
        | id_code   creati~n   to_drop |
        |------------------------------|
        |   B0020        1.8         0 |
        |   B0020       3.26         1 |
        |------------------------------|
        |   B0021       1.88         0 |
        |   B0021          .         1 |
        |------------------------------|
        |   B0045          .         0 |
        |   B0045          .         1 |
        |------------------------------|
        |   B0077        2.1         0 |
        |   B0077          .         1 |
        |------------------------------|
        |   B0081        4.8         0 |
        |   B0081          .         1 |
        |------------------------------|
        |   B2002        2.2         0 |
        |------------------------------|
        |   B2042        1.7         0 |
        |   B2042        1.9         1 |
        +------------------------------+
      you can then just
      Code:
      drop if to_drop
      Here, I created to_drop just to show you which observations were going to be dropped. You could instead directly do
      Code:
      bys id_code (creatinin): drop if _N>1 & _n != 1
      Last edited by Hemanshu Kumar; 28 Nov 2022, 09:47.

      Comment


      • #4
        A big thanks to both of you for the quick replies, worked out great!

        Comment

        Working...
        X