Identifying samples from same individual using matchit, or similar

Tom Yates

Join Date: Mar 2015

Posts: 34
#1

Identifying samples from same individual using matchit, or similar

13 Mar 2024, 11:00

I am attempting to analyse a large laboratory dataset containing manually entered identifiers, e.g. name, date of birth, location, etc. Unique identifiers are not always assigned, the same individual may have a few different hospital numbers, e.g. if they were transferred between facilities, and there may be typos, spelling variations, etc.

Code:

clear input str30 Name1 str30 Name2 str30 DOB str30 uniqueID str30 hospitalnumber "John" "Smith" "01031923" "13579" "12346X" "Robert" "Brown" "05051940" "." "A3334" "Mary" "Smith" "04122000" "." "A5322" "Jon" "Smith" "01031923" "13579" "A-23455" "Rob" "Brown" "05051940" "." "3334" "John" "Smit" "01031923" "." "12346X" end

Is it possible to use the matchit command, or similar, to identify records that are likely to be the same individual? E.g., in my example, there would be three people, with JS having two different hospital numbers. I have seen examples, on STATAlist, of matchit being used for deduplication, but not for assigning tags to records that are the same individual, utilising variables from a number of different fields.

Suggestions welcome!
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3152
#2

13 Mar 2024, 11:54

Names are poorly coded, presenting problems.

DOB is going to be key identifier.

Might try something like this:

Code:

g n1 = substr(Name1,1,2) g n2 = substr(Name2,1,2) egen id = group(n1 n2 DOB)

but you'll need to look for oddities.
Comment
Tom Yates

Join Date: Mar 2015

Posts: 34
#3

16 Mar 2024, 02:16

Thanks George Ford, I agree DOB will be valuable here. I had wondered whether Julio Raffo's matchit package might be a way around the potential poor/variable coding of names? But I've not seen it used in quite this way, where data may be missing or poorly coded across a number of variables.

Perhaps a teired set of rules might work here

1. Group everything with a matching unique identifier
2. Concatenate name and surname
3. Group names where matchit scores are above a threshold, requiring at least one other variable to match, and perhaps adjusting the threshold depending on the number of other fields that match

A sensitivity analysis would probably be appropriate - running the analysis with both stringent and loose thresholds - and I can also run some biological plausibility checks, e.g. looking for instances where supposedly the same individual flips from seropositive to seronegative (shouldn't usually happen).

If anyone has code that does something like this, I would be keen to have a look!

Tom
Comment

Julio Raffo

Join Date: May 2014
Posts: 132

18 Mar 2024, 02:19

Hi Tom Yates , yes, you can use -matchit- within your process. You just need to run -matchit- against the same master file. And, in my opinion, your intuition is correct about merging the fields. If you have other fields (such as addresses or birthdates), you could also merge them or use them after the matchit scores.

Based on your example above (+ two new lines to test for homonyms):

Code:

 clear input str30 Name1 str30 Name2 str30 DOB str30 uniqueID str30 hospitalnumber
"John" "Smith" "01031923" "13579" "12346X"
"Robert" "Brown" "05051940" "." "A3334"
"Mary" "Smith" "04122000" "." "A5322"
"Jon" "Smith" "01031923" "13579" "A-23455"
"Rob" "Brown" "05051940" "." "3334"
"John" "Smit" "01031923" "." "12346X"
"John" "Smith" "01031943" "22222" "12346X"
"Robert" "Brown" "05051965" "." "A3334"
end
save yourfile.dta, replace
 
* 1. Concatenate name and surname (I suggest you start like this )  
use yourfile.dta, clear  
gen fullname=Name1+" "+Name2+" "+DOB+" "+uniqueID
egen long id=group(fullname) // ids have to be numeric for matchit
save yourfile_with_id.dta, replace  

* 2. Group everything with a matching unique identifier
use yourfile_with_id.dta, clear
keep id fullname
* deduplicate your data to avoid unnecessary matches
gduplicates drop  
*save your new clean file to match
save yourfile_dedup.dta, replace  

* 3. Run matchit
use yourfile_dedup.dta, clear
ren (id fullname) (id1 fullname1)
matchit id1 fullname1 using yourfile_dedup.dta, idu(id) txtu(fullname) w(log) over  

* 4. Group names where matchit scores are above a threshold, requiring at least one other variable to match, and perhaps adjusting the threshold depending on the number of other fields that match
* run this for a manual inspection to establish threshold
gsort -similscore
br
// Delete what you don't want to match (here I say it is .73)
drop if similscore<.73  

* Group names:
ren (id fullname) (id2 fullname2)
gen long groupid = _n  
reshape long id fullname, i(groupid ) j(n)  
drop n
gduplicates drop  
* ssc install group_id // if not installed (by Robert Picard)  
group_id groupid , matchby(id)  

* delete duplicates  
gsort -similscore // I suggest this to keep track of what is the score for each name matching the group
gduplicates drop groupid id fullname, force  
save yourfile_matching.dta, replace  

* merge back to your file
merge 1:1 id using yourfile_with_id.dta

You will find a similar code in some old slides here (see slide 8 onwards): https://www.stata.com/meeting/switzerland16/slides/raffo-switzerland16.pdf

Last edited by Julio Raffo; 18 Mar 2024, 02:23. Reason: wrong line breaks when pasting codes (please check if there are two lines of code merged)

Comment

Julio Raffo

Join Date: May 2014

Posts: 132
#5

18 Mar 2024, 02:49

On a better inspection, use the DOB with caution, check the case of the second homonym introduced. Not having uniqueIDs and having similar DOBs makes them score .77.
Comment
Tom Yates

Join Date: Mar 2015

Posts: 34
#6

20 Mar 2024, 09:08

Many thank Julio Raffo. I will have a play!

Each line in my dataset represents a different test, so 'duplicates' are repeated tests on the same individual. I am not sure I want to delete these lines, so will revise that aspect of the code. At the next stage in processing, I may wish to e.g. only count the first positive where several are observed in a short space of time. But the first step will be to group tests by individual.

With best wishes,
Tom
Comment

Announcement