Hi,
I have one dataset that contains information about the full name (two names, two surnames) of 200.000 persons, approximately. This (string) information is not of high quality, in the sense that there are a lot of names that are misspelled or incomplete.
What I am trying to do is basically a merge between this dataset and another that contains the correct information of the full name of 13.000.000 individuals (included the 200.000 of the previous database). Since the names are misspelled or incomplete, a typical merge is not really a good solution, so I am going for a fuzzy merge.
I have been trying with the reclink command, but it takes forever to run and the results I get do not make much sense (score=1.0 for the two string variables selected, with actual "values" that are completely different).
I would like to do the "fuzzy merge" using the Jaro-Winkler distance measure, but I am struggling with this. In the perfect scenario, I would be able to get more than one "merge candidate" for the 200.000 observations.
Just in case, I am currently using Stata 14.0.
Thanks in advance,
Isidora.
I have one dataset that contains information about the full name (two names, two surnames) of 200.000 persons, approximately. This (string) information is not of high quality, in the sense that there are a lot of names that are misspelled or incomplete.
What I am trying to do is basically a merge between this dataset and another that contains the correct information of the full name of 13.000.000 individuals (included the 200.000 of the previous database). Since the names are misspelled or incomplete, a typical merge is not really a good solution, so I am going for a fuzzy merge.
I have been trying with the reclink command, but it takes forever to run and the results I get do not make much sense (score=1.0 for the two string variables selected, with actual "values" that are completely different).
I would like to do the "fuzzy merge" using the Jaro-Winkler distance measure, but I am struggling with this. In the perfect scenario, I would be able to get more than one "merge candidate" for the 200.000 observations.
Just in case, I am currently using Stata 14.0.
Thanks in advance,
Isidora.
Comment