Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to compare strings as to their similarity in a panel dataset with unique identifier

    Dear all,

    I am working with a matched employer-employee dataset from Brazil and facing a similar issue to the one Danilo Silva presented in a previous post (https://www.statalist.org/forums/for...-same-variable) with a difference: I observe different names spelling for the same ID in the variable 'name'.

    So taking as an example what Danilo showed, my case is as the following with the additional ID denoted by "D":

    id name
    A Jeff Ready
    A Jeffrey Ready
    B John Luther Schneider
    B John Luter Schneider
    C Robert D. King
    C Robert King
    D Maria Santos
    D Claudio Miranda

    In particular, I would like to detect when the different name spellings effectively refer to the same person and when we likely have a mistake, i.e., when the same ID probably refers to two different individuals.

    In this example, the ID == "D" is such a case where we probably are handling with typo in the variable 'ID' as the name spellings are very different (Maria is presumably a woman and Claudio is presumably a man).

    My goal is first to flag and then remove (but after eyeballing to check whether the result is appropriate) observations that have an ID value compatible with the "mistake" described above.

    Can you help me figure out how to proceed?

    Thank you very much!


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str1 id str21 name
    "A" "Jeff Ready"           
    "A" "Jeffrey Ready"        
    "B" "John Luther Schneider"
    "B" "John Luter Schneider" 
    "C" "Robert D. King"       
    "C" "Robert King"          
    "D" "Maria Santos"         
    "D" "Claudio Miranda"   
    end

  • #2
    Nothing along these lines will be foolproof. You will need to basically score the pairs on their degree of dissimilarity and then manually confirm. The -soundex()- function generates Soundex codes, which were specifically developed by the US Census Bureau for use in fuzzy matching of names. Here's one approach:

    Code:
    split name
    replace name3 = name2 if missing(name3)
    replace name2 = "" if name2 == name3
    
    forvalues i = 1/3 {
        gen code`i' = soundex(name`i')
    }
    
    by id, sort: gen mismatches = (code1[1] != code1[2]) + (code3[1] != code3[2]) ///
        + (code2[1] != code2[2] & !missing(code2[1]) & !missing(code2[2]))
    When mismatches == 0 you can be reasonably confident it's the same person. When mismatches >= 2, you can be pretty sure it's two different people. When mismatches == 1, that's a grey area and probably requires manual handling. Note that the code does not count a middle name vs no middle name as a mismatch, but it does count two different middle names (or initials) as a mismatch.

    But again, nothing is guaranteed. And some degree of due diligence even on the mismatches == 0 and mismatches >= 2 cases is probably wise.

    Comment


    • #3
      Dear Clyde Schechter , thank you very much for your reply!

      Yes, I thought about using 'soundex' but for Brazilian names in a few tests I did it was not very appropriate (as you highlighted, it was intended to be used by the US Census Bureau).

      Actually, in the meantime between posting and receiving your reply, I tried reshaping the data and comparing strings using 'strdist' (from SSC), which seems to be a promising way:


      id name1 name2 score_strdist
      A Jeff Ready Jeffrey Ready 3
      B John Luter Schneider John Luther Schneider 1
      C Robert D. King Robert King 3
      D Claudio Miranda Maria Santos 11
      Code:
      *ssc install strdist
      
      gcontract id name
      by id: g dup = cond(_N == 1, 1, _n)
      drop _freq
      reshape wide name, i(id) j(dup)
      
      ustrdist name1 name2, g(score_strdist)
      I think I will start eyeballing the outliers in the variable 'score_strdist'.
      Last edited by Otavio Conceicao; 27 May 2021, 07:44.

      Comment


      • #4
        Hi Otavio,

        Following your example, you can also do this using -matchit- , in two different ways:

        1. Very similar to what you did:

        Code:
        clear
        input str1 id str21 name
        "A" "Jeff Ready"           
        "A" "Jeffrey Ready"        
        "B" "John Luther Schneider"
        "B" "John Luter Schneider"
        "C" "Robert D. King"       
        "C" "Robert King"          
        "D" "Maria Santos"         
        "D" "Claudio Miranda"   
        end
        
        
        bysort id: g dup = cond(_N == 1, 1, _n)
        reshape wide name, i(id) j(dup)
        
        matchit name1 name2
        2. Alternatively (specially if you have more than one duplicate per id), you can do this:

        Code:
        clear
        input str1 id str21 name
        "A" "Jeff Ready"           
        "A" "Jeffrey Ready"        
        "B" "John Luther Schneider"
        "B" "John Luter Schneider"
        "C" "Robert D. King"       
        "C" "Robert King"          
        "D" "Maria Santos"         
        "D" "Claudio Miranda"   
        end
        
        tempfile myusing
        gen n=_n // this is to create an index
        save `myusing'
        ren (n name) (n2 name2)
        joinby id using `myusing'
        drop if n<=n2 // this is to drop the elements below the diagonal to avoid running matchit where it is not needed.
        drop if name == name2 // this is to exclude the perfect matches (but you can keep it)
        matchit name name2
        I hope this helps



        Comment


        • #5
          Thank you very much, Julio Raffo !!

          Your solution helps a lot, especially option #2 as I do have up to 6 duplicates per id.

          Comment

          Working...
          X