Hello all,
I'm a new Stata user and new to Statalist; I'm currently using StataMP 14.
I'm doing some internal reporting and analysis for my jurisdiction's contact tracing efforts, and attempting to clump together duplicate observations that were auto-generated from different sources. I feel like I'm almost there but having trouble applying some of the literature on the forum (e.g., Stata Tip 51: Events in intervals).
My ultimate goal is to create a new unique id based off of four criteria (first 3 letters of first name + first three of last + DOB + date of last exposure only if dates fall within one week of each other. I'm stuck at the last criteria and don't know how to apply a loop to my given problem (if that's what's even needed) of including only certain dates within groups (those that fall within one week of each other), and excluding the rest.
I was able to group duplicates and assign them an ID in order to calculate the elapsed time within the group; however, I'm inadvertently excluding some of the dates that I wish to keep. I totaled the elapsed time within groups with the variable "between" and then created a binary variable to flag if groups were within one week of each other (varname: withinoneweek), at which point I would generate a number only for those groups that met the 'within one week' criteria. Once I group dates that fall within one week I can assign them another group number and then concatenate that to the first part of my unique id. As you can see, my method is incomplete and the first four observations were excluded.
I've generated some dummy data that exemplifies the issue at hand. Below, I've included syntax that I've used to get me up to the point where I'm currently stuck. My apologies ahead of time for the confusion as I'm new to Stata.
I'm a new Stata user and new to Statalist; I'm currently using StataMP 14.
I'm doing some internal reporting and analysis for my jurisdiction's contact tracing efforts, and attempting to clump together duplicate observations that were auto-generated from different sources. I feel like I'm almost there but having trouble applying some of the literature on the forum (e.g., Stata Tip 51: Events in intervals).
My ultimate goal is to create a new unique id based off of four criteria (first 3 letters of first name + first three of last + DOB + date of last exposure only if dates fall within one week of each other. I'm stuck at the last criteria and don't know how to apply a loop to my given problem (if that's what's even needed) of including only certain dates within groups (those that fall within one week of each other), and excluding the rest.
I was able to group duplicates and assign them an ID in order to calculate the elapsed time within the group; however, I'm inadvertently excluding some of the dates that I wish to keep. I totaled the elapsed time within groups with the variable "between" and then created a binary variable to flag if groups were within one week of each other (varname: withinoneweek), at which point I would generate a number only for those groups that met the 'within one week' criteria. Once I group dates that fall within one week I can assign them another group number and then concatenate that to the first part of my unique id. As you can see, my method is incomplete and the first four observations were excluded.
I've generated some dummy data that exemplifies the issue at hand. Below, I've included syntax that I've used to get me up to the point where I'm currently stuck. My apologies ahead of time for the confusion as I'm new to Stata.
Code:
clear input str12 testid byte dup_testid float(groupid lastexposure newgroupid between totaldays withinoneweek) "NatGar17866" 5 134 22512 . 0 14 0 "NatGar17866" 5 134 22512 . 0 14 0 "NatGar17866" 5 134 22512 . . 14 0 "NatGar17866" 5 134 22512 . 0 14 0 "NatGar17866" 5 134 22522 . 10 14 0 "NatGar17866" 5 134 22526 . 4 14 0 "JohSmi17836" 4 133 22512 . . 14 0 "JohSmi17836" 4 133 22512 . 0 14 0 "JohSmi17836" 4 133 22512 . 0 14 0 "JohSmi17836" 4 133 22525 . 13 14 0 "JohSmi17836" 4 133 22526 . 1 14 0 "JanDoe17502" 4 173 22531 116 0 0 1 "JanDoe17502" 4 173 22531 116 . 0 1 "JanDoe17502" 4 173 22531 116 0 0 1 "JanDoe17502" 4 173 22531 116 0 0 1 "JanDoe17502" 4 173 22531 116 0 0 1 end format %tdnn/dd/CCYY lastexposure
Code:
egen testid = concat(partfirst partlast dob) duplicates report testid duplicates tag testid, generate(dup_testid) gsort -dup_testid testid lastexposure egen groupid = group(testid) if dup_testid >=1 bysort groupid (lastexposure): gen between = lastexposure - lastexposure[_n-1] if dup_testid >=1 bysort groupid (lastexposure): egen totaldays = total(between) if dup_testid >=1 gen withinoneweek = cond(totaldays >= 0 & totaldays <=7, 1, 0) egen newgroupid = group(groupid) if withinoneweek == 1