Fuzzy identification using jarowinkler distance

Isidora Vergara

Join Date: May 2019
Posts: 18

Fuzzy identification using jarowinkler distance

07 Nov 2019, 08:27

Hello,

I have a dataset, with 2.800.000 observations and 10 variables, containing the information of name and last name of a group of people. There is more than one row for person, but the problem is that most of the names are mispelled, so it is difficult to determine which information is actually about the same person.

What I would like to do is create the "family" variable, which identifies the rows referred to the same person. Since there are not fully matches, I am trying to use Jaro-Winkler distance measure, and if there is more than a 75% of similarity between observations, then the family variable will take the same value.

This is an abbreviated version of the dataset. I want to consider the similarity between the full name, because there is people that has the last name in the name variable and vice versa.

id	name	last_name	full_name	family
1	JACK	SMITH	JACKSMITH	1
2	JAACK	SMOTH	JAACKSMOTH	1
3	JAC	S.	JACS.	1
4	HARRY	BAKER	HARRYBAKER	2
5	RYAN	MILLER	RYANMILLER	3
6	MILLER	RIAN	MILLERRIAN	3
7	OLIVER	PARKER	OLIVERPARKER	4
8	OLIVERR	PARKER	OLIVERRPARKER	4
9	OLLI	ER	OLLIER	¿4-5?
10	HOLLIE	TURNER	HOLLIETURNER	5

Just in case, I am currently using Stata 14.

Thanks in advance,
Isidora.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#2

07 Nov 2019, 13:38

I am not familiar with the Jaro-Winkler distance measure myself, but there is a community-contributed program jarowinkler available from SSC. I have not used it, so I cannot give you a positive or negative recommendation, but it sounds like something you should try.

If that does not prove satisfactory, and if you are not married to the Jaro-Winkler metric, Julio Raffo's -matchit- program, also available from SSC, offers several different metrics and, in my experience, works well. For English names, consider using the -soundex- method: it was designed for that purpose.
Comment

Announcement

Fuzzy identification using jarowinkler distance

Comment