How to choose which duplicates to drop/keep?

Laura Holm

Join Date: Apr 2019

Posts: 2
#1

How to choose which duplicates to drop/keep?

30 Apr 2019, 00:39

Hi there,

I'm working on data from 2012-2018 concerning HIV-positive pregnant women in treatment and loss to follow-up. For those who have been in treatment more than once in relation to having more children, their ID occurs two, three or four times. The information attached to the duplicate ID's are, among other, start and end date (some end dates are missing) of treatment. I need to maintain the "latest" IDs and date variables for those that occur more than once in order to trace their lastes contact (among those where end date is missing) with the clinic and register whether or not they are LTFU. Each woman is only supposed to occur once during the study period. If I make a simple "duplicates drop idp, force" I will have to use the start date of their first treatment and their latest contact with the clinic from later visits to calculates their follow-up time which then will be too long.

Any thoughts? (I'm using Stata 14 on a Mac)

Thank you!

Best regards, Laura
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35696
#2

30 Apr 2019, 01:37

This does not sound like a case for duplicates at all. Perhaps you should work with maximum and minimum dates. For example, commands of the form

Code:

egen max_y = max(y), by(idp) egen min_y = min(y), by(idp)

will calculate first and last dates. If you wish you can then do analyses conditional on an observation being the first date, or the last date.

Code:

.... if y == max_y

Most drastic of all would be to reduce the dataset to one observation per patient.

Code:

keep if y == max_y

Here naturally y is generic as other than idp you don't give any variable names.
Comment

Announcement

How to choose which duplicates to drop/keep?

Comment