Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying samples from same individual using matchit, or similar

    I am attempting to analyse a large laboratory dataset containing manually entered identifiers, e.g. name, date of birth, location, etc. Unique identifiers are not always assigned, the same individual may have a few different hospital numbers, e.g. if they were transferred between facilities, and there may be typos, spelling variations, etc.

    Code:
    clear
    input str30 Name1 str30 Name2 str30 DOB str30 uniqueID str30 hospitalnumber
    "John" "Smith" "01031923" "13579" "12346X"
    "Robert" "Brown" "05051940" "." "A3334"
    "Mary" "Smith" "04122000" "." "A5322"
    "Jon" "Smith" "01031923" "13579" "A-23455"
    "Rob" "Brown" "05051940" "." "3334"
    "John" "Smit" "01031923" "." "12346X"
    end
    Is it possible to use the matchit command, or similar, to identify records that are likely to be the same individual? E.g., in my example, there would be three people, with JS having two different hospital numbers. I have seen examples, on STATAlist, of matchit being used for deduplication, but not for assigning tags to records that are the same individual, utilising variables from a number of different fields.

    Suggestions welcome!

  • #2
    Names are poorly coded, presenting problems.

    DOB is going to be key identifier.

    Might try something like this:

    Code:
    g n1 = substr(Name1,1,2)
    g n2 = substr(Name2,1,2)
    egen id = group(n1 n2 DOB)
    but you'll need to look for oddities.

    Comment


    • #3
      Thanks George Ford, I agree DOB will be valuable here. I had wondered whether Julio Raffo's matchit package might be a way around the potential poor/variable coding of names? But I've not seen it used in quite this way, where data may be missing or poorly coded across a number of variables.

      Perhaps a teired set of rules might work here

      1. Group everything with a matching unique identifier
      2. Concatenate name and surname
      3. Group names where matchit scores are above a threshold, requiring at least one other variable to match, and perhaps adjusting the threshold depending on the number of other fields that match

      A sensitivity analysis would probably be appropriate - running the analysis with both stringent and loose thresholds - and I can also run some biological plausibility checks, e.g. looking for instances where supposedly the same individual flips from seropositive to seronegative (shouldn't usually happen).

      If anyone has code that does something like this, I would be keen to have a look!

      Tom

      Comment


      • #4
        Hi Tom Yates , yes, you can use -matchit- within your process. You just need to run -matchit- against the same master file. And, in my opinion, your intuition is correct about merging the fields. If you have other fields (such as addresses or birthdates), you could also merge them or use them after the matchit scores.

        Based on your example above (+ two new lines to test for homonyms):

        Code:
         clear input str30 Name1 str30 Name2 str30 DOB str30 uniqueID str30 hospitalnumber
        "John" "Smith" "01031923" "13579" "12346X"
        "Robert" "Brown" "05051940" "." "A3334"
        "Mary" "Smith" "04122000" "." "A5322"
        "Jon" "Smith" "01031923" "13579" "A-23455"
        "Rob" "Brown" "05051940" "." "3334"
        "John" "Smit" "01031923" "." "12346X"
        "John" "Smith" "01031943" "22222" "12346X"
        "Robert" "Brown" "05051965" "." "A3334"
        end
        save yourfile.dta, replace
         
        * 1. Concatenate name and surname (I suggest you start like this )  
        use yourfile.dta, clear  
        gen fullname=Name1+" "+Name2+" "+DOB+" "+uniqueID
        egen long id=group(fullname) // ids have to be numeric for matchit
        save yourfile_with_id.dta, replace  
        
        * 2. Group everything with a matching unique identifier
        use yourfile_with_id.dta, clear
        keep id fullname
        * deduplicate your data to avoid unnecessary matches
        gduplicates drop  
        *save your new clean file to match
        save yourfile_dedup.dta, replace  
        
        * 3. Run matchit
        use yourfile_dedup.dta, clear
        ren (id fullname) (id1 fullname1)
        matchit id1 fullname1 using yourfile_dedup.dta, idu(id) txtu(fullname) w(log) over  
        
        * 4. Group names where matchit scores are above a threshold, requiring at least one other variable to match, and perhaps adjusting the threshold depending on the number of other fields that match
        * run this for a manual inspection to establish threshold
        gsort -similscore
        br
        // Delete what you don't want to match (here I say it is .73)
        drop if similscore<.73  
        
        * Group names:
        ren (id fullname) (id2 fullname2)
        gen long groupid = _n  
        reshape long id fullname, i(groupid ) j(n)  
        drop n
        gduplicates drop  
        * ssc install group_id // if not installed (by Robert Picard)  
        group_id groupid , matchby(id)  
        
        * delete duplicates  
        gsort -similscore // I suggest this to keep track of what is the score for each name matching the group
        gduplicates drop groupid id fullname, force  
        save yourfile_matching.dta, replace  
        
        * merge back to your file
        merge 1:1 id using yourfile_with_id.dta

        You will find a similar code in some old slides here (see slide 8 onwards): https://www.stata.com/meeting/switzerland16/slides/raffo-switzerland16.pdf
        Last edited by Julio Raffo; 18 Mar 2024, 03:23. Reason: wrong line breaks when pasting codes (please check if there are two lines of code merged)

        Comment


        • #5
          On a better inspection, use the DOB with caution, check the case of the second homonym introduced. Not having uniqueIDs and having similar DOBs makes them score .77.

          Comment


          • #6
            Many thank Julio Raffo. I will have a play!

            Each line in my dataset represents a different test, so 'duplicates' are repeated tests on the same individual. I am not sure I want to delete these lines, so will revise that aspect of the code. At the next stage in processing, I may wish to e.g. only count the first positive where several are observed in a short space of time. But the first step will be to group tests by individual.

            With best wishes,
            Tom

            Comment

            Working...
            X