Hello,
I have a dataset, with 2.800.000 observations and 10 variables, containing the information of name and last name of a group of people. There is more than one row for person, but the problem is that most of the names are mispelled, so it is difficult to determine which information is actually about the same person.
What I would like to do is create the "family" variable, which identifies the rows referred to the same person. Since there are not fully matches, I am trying to use Jaro-Winkler distance measure, and if there is more than a 75% of similarity between observations, then the family variable will take the same value.
This is an abbreviated version of the dataset. I want to consider the similarity between the full name, because there is people that has the last name in the name variable and vice versa.
Just in case, I am currently using Stata 14.
Thanks in advance,
Isidora.
I have a dataset, with 2.800.000 observations and 10 variables, containing the information of name and last name of a group of people. There is more than one row for person, but the problem is that most of the names are mispelled, so it is difficult to determine which information is actually about the same person.
What I would like to do is create the "family" variable, which identifies the rows referred to the same person. Since there are not fully matches, I am trying to use Jaro-Winkler distance measure, and if there is more than a 75% of similarity between observations, then the family variable will take the same value.
This is an abbreviated version of the dataset. I want to consider the similarity between the full name, because there is people that has the last name in the name variable and vice versa.
id | name | last_name | full_name | family |
1 | JACK | SMITH | JACKSMITH | 1 |
2 | JAACK | SMOTH | JAACKSMOTH | 1 |
3 | JAC | S. | JACS. | 1 |
4 | HARRY | BAKER | HARRYBAKER | 2 |
5 | RYAN | MILLER | RYANMILLER | 3 |
6 | MILLER | RIAN | MILLERRIAN | 3 |
7 | OLIVER | PARKER | OLIVERPARKER | 4 |
8 | OLIVERR | PARKER | OLIVERRPARKER | 4 |
9 | OLLI | ER | OLLIER | ¿4-5? |
10 | HOLLIE | TURNER | HOLLIETURNER | 5 |
Just in case, I am currently using Stata 14.
Thanks in advance,
Isidora.
Comment