Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with reclink: perform fuzzy matches of a variable within exact matches of another variable

    I am trying to perform a record linking in which I have two variables: 'cod' is a 6-digit code stored in string format and 'name' is a string variable with the name of a person.

    My idea is to first get the exact 'cod' matches and then perform a fuzzy matching with names within the same value for 'cod'.

    I copy below my example datasets. What I want is that both observation with cod == "530461" and name "WAGNER OLIVEIRA" and observation with the same cod but name "VAGNER OLIVEIRA" in the master dataset is matched with observation with the same cod and name "WAGNER OLIVEIRA" in the using dataset, since it is just a tiny variation of the name.

    I have tried with different options in reclink, using orblock, wmatch and wnomatch, but I did not manage to do that. It only performs exact cod-name matches, but I cannot manage to perform the fuzzy matching for those tiny variations of names.

    Here's the reproducible example:

    Code:
    /*
    ssc install reclink
    ssc install dataex
    */
    
    clear
    input byte id_using str6 cod str24 name byte var_using
    1    "530461"    "WAGNER OLIVEIRA"            0
    2    "675232"    "MARIANA COUTINHO"            1
    3    "675232"    "JOANA DA SILVA"            0    
    4    "513372"    "ROMEU DE SOUZA"            0
    5    "808747"    "JULIETA CORREA DOS ANJOS"    1
    6    "650334"    "JULIETA CORREA APARECIDA"    1
    7    "351475"    "ROSANGELA DIRCKSCHNEIDER"    0
    8    "970505"    "TOMIKI SHIOKI"                0
    9    "351475"    "ANA MARIA MELO FRANCO"        0
    10    "773263"    "PROTOGENES HERMENEGILDO"    0
    11    "530461"    "ABADIO DOS SANTOS"            1
    end
    
    sort cod name
    tempfile using
    save `using'
    clear
    
    input byte id_master str6 cod str24 name float var_master
    1    "530461"    "WAGNER OLIVEIRA"            0.900256205
    2    "" ""                                    0.244029951
    3    "675232"    "MARIANA COUTINHO"            0.797757411
    4    "" ""                                    0.20090559
    5    "" ""                                    0.23264436
    6    "530461"    "VAGNER OLIVEIRA"            0.534601937
    7    "675232"    "JOANA DA SILVA"            0.138305611
    8    "513372"    "ROMEU DE SOUZA"            0.605197148
    9    "808747"    "JULIETA CORREA DOS ANJOS"    0.769864143
    10    "650334"    "JULIETA CORREA APARECIDA"    0.115634447
    11    "351475"    "ROSANGELA DIRCKSCHNEIDER"    0.983849107
    12    "" ""                                    0.794636935
    13    "970505"    "TOMIKI SHIOKI"                0.129721648
    14    "351475"    "ANA MARIA MELO FRANCO"        0.253776459
    15    "351475"    "ANA MARIA MELO FRANCO"        0.612023696
    16    "" ""                                    0.649227503
    17    "773263"    "PROTOGENES HERMENEGILDO"    0.154941878
    18    "675232"    "MARIANA S COUTINHO"        0.642434356
    19    "" ""                                    0.566767856
    20    "530461"    "ABADIO DOS SANTOS"            0.279123444
    end
    
    sort cod name
    reclink cod name using `using', ///
            idmaster(id_master) idusing(id_using) ///
            gen(match_score)

  • #2
    Try reclink2 with the options manytoone and npairs(). In Stata, type search reclink2. Read the associated Stata Journal article to learn why and when reclink2 is better than reclink.

    Comment


    • #3
      Thank you Anders, it works with reclink2 with those options!

      Comment

      Working...
      X