Search within groups and dropping unwanted records

Nathan Garst

Join Date: Dec 2021

Posts: 15
#1

Search within groups and dropping unwanted records

30 Mar 2022, 19:38

Hi everyone,

I'm working with vaccination data and I have loads of duplicates, and need to get rid of them. I want to give priority to the duplicate records where people are 'fully vaccinated', and drop all other obs within that group that are either 'partially vaccinated' or missing. Similarly, if a duplicate group has only 'partially vaccinated' duplicates and missing, I want to only keep one of the partials.

As a caveat, I'd like to preserve if surveysent == True and surveyresp == True for the duplicate obs (if there are any) after the above vaccination criteria are met.

This is complex and can't quite wrap my tiny brain around it. Any thoughts?

Code:

clear input long(vaxstat surveysent surveyresp) float group 1 2 1 25 1 1 1 25 . 2 1 25 . 1 1 25 2 2 1 24 2 1 1 24 1 2 1 24 1 1 1 24 2 2 2 19 2 1 1 19 . 2 2 19 . 1 1 19 end label values vaxstat vaxstat label def vaxstat 1 "1. Fully Vaccinated", modify label def vaxstat 2 "2. Partially Vaccinated", modify label values surveysent surveysent1 label def surveysent1 1 "1. FALSE", modify label def surveysent1 2 "2. TRUE", modify label values surveyresp surveyresp label def surveyresp 1 "1. FALSE", modify label def surveyresp 2 "2. TRUE", modify
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

30 Mar 2022, 19:53

Code:

gsort group vaxstat -surveysent -surveyresp by group: keep if _n == 1

You have a hierarchical preference ordering criterion for the observations within groups. Your top-level criterion is the fullest level of vaccination, which corresponds to the lowest numerical value of the variable vaxstat. Your secondary criterion is based on surveysent and survey resp, and in this case the preference is for the highest numerical value of the variables surveyresp and surveysent. So the logic is to just sort the data into order of descending preference--which corresponds to sorting within groups on vaxstat (in ascending order) and surveysent and survey resp (in descending order). Then you just keep the first observation in each group.

Note: Using 1/2 coding for a boolean variable with 1 = false and 2 = true is going to make your life unnecessarily complicated in Stata. The way Stata naturally handles boolean expressions, 0 is false and anything other than 0 is true. So this makes 1 = true / 0 = false the most natural coding. I recommend you recode those variables. The code shown above would not change because it would still be the case that true corresponds to the numerically higher value if you do that.
Comment
Nathan Garst

Join Date: Dec 2021

Posts: 15
#3

04 Apr 2022, 10:27

You have a hierarchical preference ordering criterion for the observations within groups.

Thank you Clyde, this is a much simpler solution than I imagined and very helpful to know as I deal with complicated de-duplication a lot.

Note: Using 1/2 coding for a boolean variable with 1 = false and 2 = true is going to make your life unnecessarily complicated in Stata.

Thanks for this tip as well. I'll get in the habit of modifying the labels after running encode on these true/false variables.
Comment

Announcement

Search within groups and dropping unwanted records

Comment

Comment