Create identifier for a variable that is in two data sets

Lisa Alejandra Kobrich

Join Date: Dec 2023

Posts: 17
#1

Create identifier for a variable that is in two data sets

20 Jan 2024, 05:28

Hello!

I have to merge two data sets. They both have the 'name' variable in common. But I need an identifier that will facilitate the matching.

I have two issues with this:
1. The names across data sets are not consistent. In one data set, the name variable might include a middle name and in the other one, it might or not include it. I hope the table below helps illustrate the inconsistency I find between data sets.
data set 1 data set 2

Name Name

AA BB CC AA CC

DD EE DD EE FF

*I am thinking that maybe I could extract each name and create three or more variables for each name. And then create an identifier that matches at least two names together (for both data sets)

2. Even if I find a solution to first problem. How can I ensure that the identifiers I create will be consistent across data sets?

Thank you in advance for any tipp on how to solve any of my issues.

Last edited by Lisa Alejandra Kobrich; 20 Jan 2024, 05:59.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35219
#2

20 Jan 2024, 05:57

You can remove middle names like this

Code:

clonevar name_original = name replace name = word(name, 1) + " " + word(name, -1)

which should work for anything matching the pattern you show (assuming that DD EE FF really means DD FF EE).

My guess is that you may have other problems too. A variant of Murphy's Law is that large datasets based on real people are usually messy.

There is no simple good news here. You may need to iterate around

1. Try a merge with your revised name variables.

2. See which values don't match.

3. Realise that your names need more surgery and revise your name variable.

Never lose the original name data.

The more detail you lose, the more likely false matches will be. Clearly, but crucially, "Lisa Kobrich" (say) could match several people with different middle names, or none.
Comment
Lisa Alejandra Kobrich

Join Date: Dec 2023

Posts: 17
#3

20 Jan 2024, 06:48

Hi Nick,

Thank you for your quick answer!

Actually, I do not know whether the name is actually DD EE FF or DD FF EE. I am not familiar with those names so I do not know whether they have several middle names or family names.

I just noticed that there was one observation that had as name DD EE in one data set and DD EE FF in the other data set. It could be that they have two family names and in one data set, they included only one, and in the other both. That is why I thought about extracting the first, second (if there were more than 2 words), and last word of each name. And then create an identifier that merges one data set with the other if at least two words match.

Not sure if this would be the best solution or if I am just complicating myself but that is what I thought so far.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9958

20 Jan 2024, 13:22

You can try fuzzy matching. However, you need to examine the output carefully and also come up with an appropriate cutoff. Below, I use matchit from SSC. Alternatively, after crossing the datasets, you can break up the names and compare combinations of pairs.

Code:

clear
input str29(Name2)    
"AA CC"
"DD EE FF"
end
tempfile dataset2
save `dataset2'

clear
input str29(Name1)    
"AA BB CC"    
"DD EE"
end

cross using `dataset2'
matchit Name1 Name2
list

keep if similscore>.7    
list

Res.:

Code:

. matchit Name1 Name2
Matching columns Name1 and Name2
Similarity function: bigram
0%
20%
60%
80%
100%
Done!

. 
. list

     +---------------------------------+
     |    Name1      Name2   similsc~e |
     |---------------------------------|
  1. | AA BB CC      AA CC   .75592895 |
  2. |    DD EE      AA CC           0 |
  3. | AA BB CC   DD EE FF           0 |
  4. |    DD EE   DD EE FF   .75592895 |
     +---------------------------------+

. 
. 
. 
. keep if similscore>.7    
(2 observations deleted)

. 
. list

     +---------------------------------+
     |    Name1      Name2   similsc~e |
     |---------------------------------|
  1. | AA BB CC      AA CC   .75592895 |
  2. |    DD EE   DD EE FF   .75592895 |
     +---------------------------------+

Comment

Lisa Alejandra Kobrich

Join Date: Dec 2023

Posts: 17
#5

22 Jan 2024, 05:19

Thank you Andrew! This helps a lot!!
Comment

data set 1	data set 2
Name	Name
AA BB CC	AA CC
DD EE	DD EE FF

Announcement

Create identifier for a variable that is in two data sets

Comment

Comment

Comment

Comment