Selecting all dates within groups that are within one week of each other and creating a sub-group id

Nathan Garst

Join Date: Dec 2021

Posts: 15
#1

Selecting all dates within groups that are within one week of each other and creating a sub-group id

08 Dec 2021, 14:04

Hello all,

I'm a new Stata user and new to Statalist; I'm currently using StataMP 14.

Purpose: I'm doing some internal reporting and analysis for my jurisdiction's contact tracing efforts, and attempting to clump together duplicate observations that were auto-generated from different sources and then give these dates a sub-group id. I feel like I'm almost there but having trouble applying some of the literature on the forum (e.g., Stata Tip 51: Events in intervals).

Goal: My ultimate goal is to create a new unique id based off of four criteria (first 3 letters of first name + first three of last + DOB + date of last exposure only if dates fall within 7 days of each other. I'm stuck at the last criteria and don't know how to apply a loop to my given problem (if that's what's even needed) of parsing out only certain dates (those that fall within one week of each other), and excluding the rest.

What I've done so far: I was able to group duplicates and assign them an ID in order to calculate the elapsed time within the group; however, I'm inadvertently excluding some of the dates that I wish to keep.

I then totaled the elapsed time within groups with the variable "between" and then created a binary variable to flag if groups were within one week of each other (varname: withinoneweek), at which point I flagged groups that met the 'within one week' criteria. Once I group dates that fall within one week I can assign them another group number and then concatenate that to the first part of my unique id. As you can see, my method is incomplete and the first four observations that have the same last exposure date were excluded.

I've generated some dummy data that exemplifies the issue at hand. Below, I've included syntax that I've used to get me up to the point where I'm currently stuck. My apologies ahead of time for the confusion as I'm new to Stata; any tips or advice would be most helpful and much appreciated.

Code:

clear input str12 testid byte dup_testid float(groupid lastexposure newgroupid between totaldays withinoneweek) "NatGar17866" 5 134 22512 . 0 14 0 "NatGar17866" 5 134 22512 . 0 14 0 "NatGar17866" 5 134 22512 . . 14 0 "NatGar17866" 5 134 22512 . 0 14 0 "NatGar17866" 5 134 22522 . 10 14 0 "NatGar17866" 5 134 22526 . 4 14 0 "JohSmi17836" 4 133 22512 . . 14 0 "JohSmi17836" 4 133 22512 . 0 14 0 "JohSmi17836" 4 133 22512 . 0 14 0 "JohSmi17836" 4 133 22525 . 13 14 0 "JohSmi17836" 4 133 22526 . 1 14 0 "JanDoe17502" 4 173 22531 116 0 0 1 "JanDoe17502" 4 173 22531 116 . 0 1 "JanDoe17502" 4 173 22531 116 0 0 1 "JanDoe17502" 4 173 22531 116 0 0 1 "JanDoe17502" 4 173 22531 116 0 0 1 end format %tdnn/dd/CCYY lastexposure

Code:

egen testid = concat(partfirst partlast dob) duplicates report testid duplicates tag testid, generate(dup_testid) gsort -dup_testid testid lastexposure egen groupid = group(testid) if dup_testid >=1 bysort groupid (lastexposure): gen between = lastexposure - lastexposure[_n-1] if dup_testid >=1 bysort groupid (lastexposure): egen totaldays = total(between) if dup_testid >=1 gen withinoneweek = cond(totaldays >= 0 & totaldays <=7, 1, 0) egen newgroupid = group(groupid) if withinoneweek == 1
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

08 Dec 2021, 14:26

I'm not sure I understand what you want to do. But to the extent I understand it, it seems like you are looking to do the impossible because "within 7 days of each other" is not a transitive relationship. So if you have the same name and dob for observation on days 1, 5, and 10, the day 1 and 5 observations are within 7 days of each other, and days 5 and 10 are, as well. But days 1 and 10 are not. So there is no consistent way to group these.

Probably I am missing something here. I think beyond a clearer explanation, it would help if you supplement your example data by showing what you would like the final results to look like--that might be more helpful than words.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#3

08 Dec 2021, 14:39

Although I don't see this as being what you asked for in #1, since I'm not clear on what you want, it dawns on me that you may be seeking to do this, because it is something people often do with this kind of data:

1. For each person (identified by first name, last name, and dob--your variable testid), order the observations chronologically.
2. Assign an id number to the person's first observation.
3. Extend that observation down the list of observations up to a maximum of 7 days.
4. If that exhausts the list of observations, proceed to the next person and repeat steps 2 and 3.
5. If there are more observations for that person beyond 7 days, assign a new id number to the next observation.
6. Extend that observation down the list of observations up to a maximum of 7 days after that.
7. Repeat steps 4-6 until the person's observations are exhausted; then proceed to the next person.

If that's what you want:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str12 testid float lastexposure "NatGar17866" 22512 "NatGar17866" 22512 "NatGar17866" 22512 "NatGar17866" 22512 "NatGar17866" 22522 "NatGar17866" 22526 "JohSmi17836" 22512 "JohSmi17836" 22512 "JohSmi17836" 22512 "JohSmi17836" 22525 "JohSmi17836" 22526 "JanDoe17502" 22531 "JanDoe17502" 22531 "JanDoe17502" 22531 "JanDoe17502" 22531 "JanDoe17502" 22531 end format %tdnn/dd/CCYY lastexposure by testid (lastexposure), sort: gen ref_date = lastexposure[1] if _n == 1 by testid (lastexposure): replace ref_date = /// cond(lastexposure <= ref_date[_n-1] + 7, ref_date[_n-1], lastexposure) if _n > 1 format %tdnn/dd/CCYY ref_date egen new_group_id = group(testid ref_date)

will do that.
1 like
Comment

Nathan Garst

Join Date: Dec 2021
Posts: 15

08 Dec 2021, 15:35

This is harder than I thought, but I think you're almost there with the code you've provided, except that I don't want observations outside the 7-day window to get a sub-group id. Obs that fall outside the 7-day window might eventually be dropped; I'm essentially trying to do some very rough de-duplication based on multiple criteria (first three of first and last + dob + and if lastexposure dates are within 7 days). But before I drop the observations that fall outside the 7 day time frame for last exposure I need to do some other reporting.

I'm not even sure this is possible, but here's an example data showing what I'd the final results to look like:

Code:

clear
input str12 testid float(groupid lastexposure between totaldays withinoneweek newgroupid)
"NatGar17866" 134 22512  .  0 1 114
"NatGar17866" 134 22512  0  0 1 114
"NatGar17866" 134 22512  0  0 1 114
"NatGar17866" 134 22512  0  0 1 114
"NatGar17866" 134 22522 10 14 0   .
"NatGar17866" 134 22526  4 14 0   .
"JohSmi17836" 133 22512  .  0 1 115
"JohSmi17836" 133 22512  0  0 1 115
"JohSmi17836" 133 22512  0  0 1 115
"JohSmi17836" 133 22525 13 14 0   .
"JohSmi17836" 133 22526  1 14 0   .
"JanDoe17502" 173 22530  .  2 1 116
"JanDoe17502" 173 22531  1  2 1 116
"JanDoe17502" 173 22531  0  2 1 116
"JanDoe17502" 173 22531  0  2 1 116
"JanDoe17502" 173 22532  1  2 1 116
"JasSmi18066" 366 22512  .  0 1 117
"JasSmi18066" 366 22512  0  0 1 117
"JasSmi18066" 366 22512  0  0 1 117
"JasSmi18066" 366 22526 14 26 0   .
"JasSmi18066" 366 22538 12 26 0   .
"GreCon7527"  214 22531  .  7 1 118
"GreCon7527"  214 22532  1  7 1 118
"GreCon7527"  214 22534  2  7 1 118
"GreCon7527"  214 22537  3  7 1 118
end
format %tdnn/dd/CCYY lastexposure

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30097

08 Dec 2021, 15:44

Ah, I see.

Code:

*Example generated by -dataex-. For more info, type help dataex
clear
input str12 testid float lastexposure
"NatGar17866" 22512
"NatGar17866" 22512
"NatGar17866" 22512
"NatGar17866" 22512
"NatGar17866" 22522
"NatGar17866" 22526
"JohSmi17836" 22512
"JohSmi17836" 22512
"JohSmi17836" 22512
"JohSmi17836" 22525
"JohSmi17836" 22526
"JanDoe17502" 22530
"JanDoe17502" 22531
"JanDoe17502" 22531
"JanDoe17502" 22531
"JanDoe17502" 22532
"JasSmi18066" 22512
"JasSmi18066" 22512
"JasSmi18066" 22512
"JasSmi18066" 22526
"JasSmi18066" 22538
"GreCon7527"  22531
"GreCon7527"  22532
"GreCon7527"  22534
"GreCon7527"  22537
end
format %tdnn/dd/CCYY lastexposure

by testid (lastexposure), sort: gen new_group_id = 1 if _n == 1
replace new_group_id = sum(new_group_id) if !missing(new_group_id)
by testid (lastexposure): replace new_group_id = new_group_id[_n-1] ///
    if _n > 1 & lastexposure <= lastexposure[1] + 7

By the way this code assumes that "within 7 days" includes day 7. If it should not, then change <= to < in the final command.

Comment

Nathan Garst

Join Date: Dec 2021
Posts: 15

09 Dec 2021, 11:17

Hi Clyde, the code you provided does exactly what it is supposed to (and I will study it later because as the [_n-1] and [1] are still confusing to me).

I get 'option if not allowed' when trying to combine egen, concat(), and if:

Code:

egen finalid = concat(testid new_group_id), punct(" ") if !missing(new_group_id)

Is it possible to flag my new duplicate groups in some way, either by creating a unique id, or by creating some sort of binary variable?

I added single observations to illustrate that I no longer can parse out the new duplicates (based on new_group_id) as there are many single obs in my dataset that have a new_group_id number too.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str12 testid float(lastexposure new_group_id)
"ClySch17866" 22524 1
"GreCon7527"  22531 2
"GreCon7527"  22532 2
"GreCon7527"  22534 2
"GreCon7527"  22537 2
"JanDoe17502" 22530 3
"JanDoe17502" 22531 3
"JanDoe17502" 22531 3
"JanDoe17502" 22531 3
"JanDoe17502" 22532 3
"JasSmi18066" 22512 4
"JasSmi18066" 22512 4
"JasSmi18066" 22512 4
"JasSmi18066" 22526 .
"JasSmi18066" 22538 .
"JohSmi17836" 22512 5
"JohSmi17836" 22512 5
"JohSmi17836" 22512 5
"JohSmi17836" 22525 .
"JohSmi17836" 22526 .
"NatGar17866" 22512 6
"NatGar17866" 22512 6
"NatGar17866" 22512 6
"NatGar17866" 22512 6
"NatGar17866" 22522 .
"NatGar17866" 22526 .
"NicCoc17836" 22525 7
end
format %tdnn/dd/CCYY lastexposure

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#7

09 Dec 2021, 12:29

Code:

egen finalid = concat(testid new_group_id) if !missing(new_group_id), punct(" ")

-if- conditions are not options, so they must precede the comma.

Is it possible to flag my new duplicate groups in some way, either by creating a unique id, or by creating some sort of binary variable?

I added single observations to illustrate that I no longer can parse out the new duplicates (based on new_group_id) as there are many single obs in my dataset that have a new_group_id number too.

Sorry, but I don't understand what you want here. Perhaps, again, showing an example including an illustration of the desired results would help.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#8

09 Dec 2021, 12:38

the [_n-1] and [1] are still confusing to me

In the general context, var[1] refers to the value of variable var in the first observation in the data set. var[_n-1], for any observation, refers to the value of variable var in the immediately preceding the current observation--which means missing value in the case of the first observation. var[_N] refers to the value of variable var in the final observation of the data set.

However, when used in a command having the -by:- prefix this works somewhat differently. A -by- prefix contains a varlist, or sometimes two varlists, the first outside of parentheses and the second contained within parentheses. The variables that are not included in parentheses define the by-group for the command, meaning that the command will be executed separately for each subset of the observations defined by a combination of the values of those variables. In a by-prefixed command, var[1] refers to the value of var in the first observation of the by-group, var[_n-1] refers to the immediately preceding observation of variable var in the bygroup (and, in the first observation of the by-group it is missing value). And var[_N] refers to the value of variable var in the final observation of the by-group.
Comment
Nathan Garst

Join Date: Dec 2021

Posts: 15
#9

13 Dec 2021, 12:20

Ah, excellent! Thank you, your explanation is much much easier to understand.

The code you helped me develop above uses the value var[1] of the by-group as reference, and creates a new_group_id for all observations of the by-group that are within the following seven days.

Originally posted by Clyde Schechter View Post

Ah, I see.

Code:

by testid (lastexposure), sort: gen new_group_id = 1 if _n == 1 replace new_group_id = sum(new_group_id) if !missing(new_group_id) by testid (lastexposure): replace new_group_id = new_group_id[_n-1] /// if _n > 1 & lastexposure <= lastexposure[1] + 7

By the way this code assumes that "within 7 days" includes day 7. If it should not, then change <= to < in the final command.

What if I now needed to consider other pairs of observations of the by-group that still meet the seven-day criteria, but that don't use the value of var[1] of the by-group as a reference point? Let's say I wanted to give a unique group id to all sets of by-group observations that meet the seven-day criteria. In my below example, observations 4 and 5 are currently left out of getting a unique new_group_id, as it's a repeat exposure that got duplicated in another month. However, they're still within seven days of each other and I'd like to give these duplicate pairs a unique id too.

Here's some data that represents the new problem. Let me know if you want me to include an example of how I would want the data to look instead:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str12 testid float(lastexposure new_group_id) "ClySch17866" 22524 1 "GreCon7527" 22531 2 "GreCon7527" 22532 2 "GreCon7527" 22626 . "GreCon7527" 22627 . "JanDoe17502" 22530 3 "JanDoe17502" 22531 3 "JanDoe17502" 22531 3 "JanDoe17502" 22531 3 "JanDoe17502" 22532 3 "JasSmi18066" 22512 4 "JasSmi18066" 22512 4 "JasSmi18066" 22519 4 "JasSmi18066" 22587 . "JasSmi18066" 22588 . "JasSmi18066" 22589 . "JohSmi17836" 22512 5 "JohSmi17836" 22512 5 "JohSmi17836" 22525 . "JohSmi17836" 22526 . "NatGar17866" 22512 6 "NatGar17866" 22512 6 "NicCox17836" 22525 7 end format %tdnn/dd/CCYY lastexposure
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#10

13 Dec 2021, 12:28

Well, I can interpret this in either of two ways:

1. You want to start a new group_id whose reference date is the date in observation 4, and use that group_id for all of that person's observations that fall within 7 days of that. And if after 7 days from the date in observation 4 we find still more exposures for the same person, we group those into 7 day "clumps" in the same manner, until we exhaust all exposures for that person. In that case, that is exactly what the code in #3 does.

2. You are now lurching to a self-contradiction, because "within 7 days" is a non-transitive relation, as noted in #2.
Comment
Nathan Garst

Join Date: Dec 2021

Posts: 15
#11

14 Dec 2021, 16:04

You're right, Clyde, it works! Thanks for the help on this:

Originally posted by Clyde Schechter View Post

1. You want to start a new group_id whose reference date is the date in observation 4, and use that group_id for all of that person's observations that fall within 7 days of that. And if after 7 days from the date in observation 4 we find still more exposures for the same person, we group those into 7 day "clumps" in the same manner, until we exhaust all exposures for that person. In that case, that is exactly what the code in #3 does.
Comment

Announcement