Fuzzy merge between two datasets, using jarowinkler distance

Isidora Vergara

Join Date: May 2019

Posts: 18
#1

Fuzzy merge between two datasets, using jarowinkler distance

30 Oct 2019, 15:18

Hi,

I have one dataset that contains information about the full name (two names, two surnames) of 200.000 persons, approximately. This (string) information is not of high quality, in the sense that there are a lot of names that are misspelled or incomplete.

What I am trying to do is basically a merge between this dataset and another that contains the correct information of the full name of 13.000.000 individuals (included the 200.000 of the previous database). Since the names are misspelled or incomplete, a typical merge is not really a good solution, so I am going for a fuzzy merge.

I have been trying with the reclink command, but it takes forever to run and the results I get do not make much sense (score=1.0 for the two string variables selected, with actual "values" that are completely different).

I would like to do the "fuzzy merge" using the Jaro-Winkler distance measure, but I am struggling with this. In the perfect scenario, I would be able to get more than one "merge candidate" for the 200.000 observations.

Just in case, I am currently using Stata 14.0.

Thanks in advance,
Isidora.
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2396
#2

30 Oct 2019, 16:07

Another popular program for such purposes is -ssc describe matchit- . I have no experience with -matchit-, and I see that it does not list Jaro-Winkler as a distance measure, but perhaps it might be of use.
Comment
David Benson

Join Date: Oct 2018

Posts: 489
#3

31 Oct 2019, 15:35

Isidora,

You might take a look at the Statalist posts here, here and here (the last one mentions strutil, which is a Stata package on Github that allows you to use Jaccard Similarity).

Some Non-Stata items you might take a look at:
1) Google Refine (which is now called Open Refine). ProPublica has a tutorial about it here

2) Stanford's DataWrangler software (now called Trifacta DataWrangler). Apparently it was a research project that they have now tried to commercialize (but you can do a free 14-day trial).
Original (with a great video) is here. Commercial version is here.
Comment
Isidora Vergara

Join Date: May 2019

Posts: 18
#4

07 Nov 2019, 07:11

Thank you to both of you, I was able to do it by using matchit.
Comment

Announcement

Fuzzy merge between two datasets, using jarowinkler distance

Comment

Comment

Comment