Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fuzzy identification using jarowinkler distance

    Hello,

    I have a dataset, with 2.800.000 observations and 10 variables, containing the information of name and last name of a group of people. There is more than one row for person, but the problem is that most of the names are mispelled, so it is difficult to determine which information is actually about the same person.

    What I would like to do is create the "family" variable, which identifies the rows referred to the same person. Since there are not fully matches, I am trying to use Jaro-Winkler distance measure, and if there is more than a 75% of similarity between observations, then the family variable will take the same value.

    This is an abbreviated version of the dataset. I want to consider the similarity between the full name, because there is people that has the last name in the name variable and vice versa.

    id name last_name full_name family
    1 JACK SMITH JACKSMITH 1
    2 JAACK SMOTH JAACKSMOTH 1
    3 JAC S. JACS. 1
    4 HARRY BAKER HARRYBAKER 2
    5 RYAN MILLER RYANMILLER 3
    6 MILLER RIAN MILLERRIAN 3
    7 OLIVER PARKER OLIVERPARKER 4
    8 OLIVERR PARKER OLIVERRPARKER 4
    9 OLLI ER OLLIER ¿4-5?
    10 HOLLIE TURNER HOLLIETURNER 5

    Just in case, I am currently using Stata 14.

    Thanks in advance,
    Isidora.

  • #2
    I am not familiar with the Jaro-Winkler distance measure myself, but there is a community-contributed program jarowinkler available from SSC. I have not used it, so I cannot give you a positive or negative recommendation, but it sounds like something you should try.

    If that does not prove satisfactory, and if you are not married to the Jaro-Winkler metric, Julio Raffo's -matchit- program, also available from SSC, offers several different metrics and, in my experience, works well. For English names, consider using the -soundex- method: it was designed for that purpose.

    Comment

    Working...
    X