Help with reclink: perform fuzzy matches of a variable within exact matches of another variable

Wagner Oliveira

Join Date: Jan 2021
Posts: 7

Help with reclink: perform fuzzy matches of a variable within exact matches of another variable

23 Aug 2021, 14:01

I am trying to perform a record linking in which I have two variables: 'cod' is a 6-digit code stored in string format and 'name' is a string variable with the name of a person.

My idea is to first get the exact 'cod' matches and then perform a fuzzy matching with names within the same value for 'cod'.

I copy below my example datasets. What I want is that both observation with cod == "530461" and name "WAGNER OLIVEIRA" and observation with the same cod but name "VAGNER OLIVEIRA" in the master dataset is matched with observation with the same cod and name "WAGNER OLIVEIRA" in the using dataset, since it is just a tiny variation of the name.

I have tried with different options in reclink, using orblock, wmatch and wnomatch, but I did not manage to do that. It only performs exact cod-name matches, but I cannot manage to perform the fuzzy matching for those tiny variations of names.

Here's the reproducible example:

Code:

/*
ssc install reclink
ssc install dataex
*/

clear
input byte id_using str6 cod str24 name byte var_using
1    "530461"    "WAGNER OLIVEIRA"            0
2    "675232"    "MARIANA COUTINHO"            1
3    "675232"    "JOANA DA SILVA"            0    
4    "513372"    "ROMEU DE SOUZA"            0
5    "808747"    "JULIETA CORREA DOS ANJOS"    1
6    "650334"    "JULIETA CORREA APARECIDA"    1
7    "351475"    "ROSANGELA DIRCKSCHNEIDER"    0
8    "970505"    "TOMIKI SHIOKI"                0
9    "351475"    "ANA MARIA MELO FRANCO"        0
10    "773263"    "PROTOGENES HERMENEGILDO"    0
11    "530461"    "ABADIO DOS SANTOS"            1
end

sort cod name
tempfile using
save `using'
clear

input byte id_master str6 cod str24 name float var_master
1    "530461"    "WAGNER OLIVEIRA"            0.900256205
2    "" ""                                    0.244029951
3    "675232"    "MARIANA COUTINHO"            0.797757411
4    "" ""                                    0.20090559
5    "" ""                                    0.23264436
6    "530461"    "VAGNER OLIVEIRA"            0.534601937
7    "675232"    "JOANA DA SILVA"            0.138305611
8    "513372"    "ROMEU DE SOUZA"            0.605197148
9    "808747"    "JULIETA CORREA DOS ANJOS"    0.769864143
10    "650334"    "JULIETA CORREA APARECIDA"    0.115634447
11    "351475"    "ROSANGELA DIRCKSCHNEIDER"    0.983849107
12    "" ""                                    0.794636935
13    "970505"    "TOMIKI SHIOKI"                0.129721648
14    "351475"    "ANA MARIA MELO FRANCO"        0.253776459
15    "351475"    "ANA MARIA MELO FRANCO"        0.612023696
16    "" ""                                    0.649227503
17    "773263"    "PROTOGENES HERMENEGILDO"    0.154941878
18    "675232"    "MARIANA S COUTINHO"        0.642434356
19    "" ""                                    0.566767856
20    "530461"    "ABADIO DOS SANTOS"            0.279123444
end

sort cod name
reclink cod name using `using', ///
        idmaster(id_master) idusing(id_using) ///
        gen(match_score)

Tags: None

Anders Alexandersson

Join Date: Apr 2014

Posts: 203
#2

27 Aug 2021, 07:04

Try reclink2 with the options manytoone and npairs(). In Stata, type search reclink2. Read the associated Stata Journal article to learn why and when reclink2 is better than reclink.
1 like
Comment
Wagner Oliveira

Join Date: Jan 2021

Posts: 7
#3

10 Feb 2022, 11:35

Thank you Anders, it works with reclink2 with those options!
Comment

Announcement

Help with reclink: perform fuzzy matches of a variable within exact matches of another variable

Comment

Comment