Drop duplicates with highest value and keep if missing

Vilma Antonov

Join Date: Aug 2022

Posts: 47
#1

Drop duplicates with highest value and keep if missing

28 Nov 2022, 08:19

Hi,

I have been searching the forum and found some variations of code which I have tried to use, but I can't get it to properly keep the right observations.

I have a dataset with 3260 observations, out of which 26 are duplicates (aka 3234 distinct IDs). I have a variable, creatinin, which can contain either a value or missing.
- If the patient has 2 observations and both have missing values; I want to keep the missing value, but only one observation.
- ff the patient has 1 observation with missing and 1 with a creatinin value; I want to keep the one with the creatinin.
- If the patient has 2 observations with creatinin values, I want to keep the lowest of the two.

I have tried:

Code:

bysort id_code: egen wanted=min(creatinin) gen OK = !missing(creatinin) bysort OK id_code (creatinin) : keep if OK & _n == _N

Which deleted 390 observations.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str16 id_code double creatinin "B0020" 1.8 "B0020" 3.26 "B0021" . "B0021" 1.88 "B0045" . "B0045" . "B0077" . "B0077" 2.1 "B0081" 4.8 "B0081" . "B2002" 2.2 "B2042" 1.7 "B2042" 1.9 end

Any help would be greatly appreciated!
Tags: None
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

28 Nov 2022, 08:32

Not at my computer, but, use duplicates to create an indicator for the patientid variable, then use egen to denote which duplicate value is the maximum value, then drop the maximum value and you'll have the missing ones. If someone doesn't beat me to it, I'll look when I'm home from the doctor's.
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1396

28 Nov 2022, 08:45

Code:

bys id_code (creatinin): gen byte to_drop = (_N>1 & _n != 1)

which produces:

Code:

. li , noobs sepby(id_code)

  +------------------------------+
  | id_code   creati~n   to_drop |
  |------------------------------|
  |   B0020        1.8         0 |
  |   B0020       3.26         1 |
  |------------------------------|
  |   B0021       1.88         0 |
  |   B0021          .         1 |
  |------------------------------|
  |   B0045          .         0 |
  |   B0045          .         1 |
  |------------------------------|
  |   B0077        2.1         0 |
  |   B0077          .         1 |
  |------------------------------|
  |   B0081        4.8         0 |
  |   B0081          .         1 |
  |------------------------------|
  |   B2002        2.2         0 |
  |------------------------------|
  |   B2042        1.7         0 |
  |   B2042        1.9         1 |
  +------------------------------+

you can then just

Code:

drop if to_drop

Here, I created to_drop just to show you which observations were going to be dropped. You could instead directly do

Code:

bys id_code (creatinin): drop if _N>1 & _n != 1

Last edited by Hemanshu Kumar; 28 Nov 2022, 08:47.

Comment

Vilma Antonov

Join Date: Aug 2022

Posts: 47
#4

28 Nov 2022, 09:04

A big thanks to both of you for the quick replies, worked out great!
Comment

Announcement

Drop duplicates with highest value and keep if missing

Comment

Comment

Comment