I am attempting to analyse a large laboratory dataset containing manually entered identifiers, e.g. name, date of birth, location, etc. Unique identifiers are not always assigned, the same individual may have a few different hospital numbers, e.g. if they were transferred between facilities, and there may be typos, spelling variations, etc.
Is it possible to use the matchit command, or similar, to identify records that are likely to be the same individual? E.g., in my example, there would be three people, with JS having two different hospital numbers. I have seen examples, on STATAlist, of matchit being used for deduplication, but not for assigning tags to records that are the same individual, utilising variables from a number of different fields.
Suggestions welcome!
Code:
clear input str30 Name1 str30 Name2 str30 DOB str30 uniqueID str30 hospitalnumber "John" "Smith" "01031923" "13579" "12346X" "Robert" "Brown" "05051940" "." "A3334" "Mary" "Smith" "04122000" "." "A5322" "Jon" "Smith" "01031923" "13579" "A-23455" "Rob" "Brown" "05051940" "." "3334" "John" "Smit" "01031923" "." "12346X" end
Suggestions welcome!
Comment