Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Search within groups and dropping unwanted records

    Hi everyone,

    I'm working with vaccination data and I have loads of duplicates, and need to get rid of them. I want to give priority to the duplicate records where people are 'fully vaccinated', and drop all other obs within that group that are either 'partially vaccinated' or missing. Similarly, if a duplicate group has only 'partially vaccinated' duplicates and missing, I want to only keep one of the partials.

    As a caveat, I'd like to preserve if surveysent == True and surveyresp == True for the duplicate obs (if there are any) after the above vaccination criteria are met.

    This is complex and can't quite wrap my tiny brain around it. Any thoughts?

    Code:
    clear
    input long(vaxstat surveysent surveyresp) float group
    1 2 1 25
    1 1 1 25
    . 2 1 25
    . 1 1 25
    2 2 1 24
    2 1 1 24
    1 2 1 24
    1 1 1 24
    2 2 2 19
    2 1 1 19
    . 2 2 19
    . 1 1 19
    end
    label values vaxstat vaxstat
    label def vaxstat 1 "1. Fully Vaccinated", modify
    label def vaxstat 2 "2. Partially Vaccinated", modify
    label values surveysent surveysent1
    label def surveysent1 1 "1. FALSE", modify
    label def surveysent1 2 "2. TRUE", modify
    label values surveyresp surveyresp
    label def surveyresp 1 "1. FALSE", modify
    label def surveyresp 2 "2. TRUE", modify

  • #2
    Code:
    gsort group vaxstat -surveysent -surveyresp
    by group: keep if _n == 1
    You have a hierarchical preference ordering criterion for the observations within groups. Your top-level criterion is the fullest level of vaccination, which corresponds to the lowest numerical value of the variable vaxstat. Your secondary criterion is based on surveysent and survey resp, and in this case the preference is for the highest numerical value of the variables surveyresp and surveysent. So the logic is to just sort the data into order of descending preference--which corresponds to sorting within groups on vaxstat (in ascending order) and surveysent and survey resp (in descending order). Then you just keep the first observation in each group.

    Note: Using 1/2 coding for a boolean variable with 1 = false and 2 = true is going to make your life unnecessarily complicated in Stata. The way Stata naturally handles boolean expressions, 0 is false and anything other than 0 is true. So this makes 1 = true / 0 = false the most natural coding. I recommend you recode those variables. The code shown above would not change because it would still be the case that true corresponds to the numerically higher value if you do that.

    Comment


    • #3
      You have a hierarchical preference ordering criterion for the observations within groups.
      Thank you Clyde, this is a much simpler solution than I imagined and very helpful to know as I deal with complicated de-duplication a lot.

      Note: Using 1/2 coding for a boolean variable with 1 = false and 2 = true is going to make your life unnecessarily complicated in Stata.
      Thanks for this tip as well. I'll get in the habit of modifying the labels after running encode on these true/false variables.

      Comment

      Working...
      X