How to compare strings as to their similarity in a panel dataset with unique identifier

Otavio Conceicao

Join Date: Feb 2017

Posts: 65
#1

How to compare strings as to their similarity in a panel dataset with unique identifier

26 May 2021, 17:06

Dear all,

I am working with a matched employer-employee dataset from Brazil and facing a similar issue to the one Danilo Silva presented in a previous post (https://www.statalist.org/forums/for...-same-variable) with a difference: I observe different names spelling for the same ID in the variable 'name'.

So taking as an example what Danilo showed, my case is as the following with the additional ID denoted by "D":

id name

A Jeff Ready

A Jeffrey Ready

B John Luther Schneider

B John Luter Schneider

C Robert D. King

C Robert King

D Maria Santos

D Claudio Miranda

In particular, I would like to detect when the different name spellings effectively refer to the same person and when we likely have a mistake, i.e., when the same ID probably refers to two different individuals.

In this example, the ID == "D" is such a case where we probably are handling with typo in the variable 'ID' as the name spellings are very different (Maria is presumably a woman and Claudio is presumably a man).

My goal is first to flag and then remove (but after eyeballing to check whether the result is appropriate) observations that have an ID value compatible with the "mistake" described above.

Can you help me figure out how to proceed?

Thank you very much!

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str1 id str21 name "A" "Jeff Ready" "A" "Jeffrey Ready" "B" "John Luther Schneider" "B" "John Luter Schneider" "C" "Robert D. King" "C" "Robert King" "D" "Maria Santos" "D" "Claudio Miranda" end
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29942
#2

26 May 2021, 22:54

Nothing along these lines will be foolproof. You will need to basically score the pairs on their degree of dissimilarity and then manually confirm. The -soundex()- function generates Soundex codes, which were specifically developed by the US Census Bureau for use in fuzzy matching of names. Here's one approach:

Code:

split name replace name3 = name2 if missing(name3) replace name2 = "" if name2 == name3 forvalues i = 1/3 { gen code`i' = soundex(name`i') } by id, sort: gen mismatches = (code1[1] != code1[2]) + (code3[1] != code3[2]) /// + (code2[1] != code2[2] & !missing(code2[1]) & !missing(code2[2]))

When mismatches == 0 you can be reasonably confident it's the same person. When mismatches >= 2, you can be pretty sure it's two different people. When mismatches == 1, that's a grey area and probably requires manual handling. Note that the code does not count a middle name vs no middle name as a mismatch, but it does count two different middle names (or initials) as a mismatch.

But again, nothing is guaranteed. And some degree of due diligence even on the mismatches == 0 and mismatches >= 2 cases is probably wise.
Comment
Otavio Conceicao

Join Date: Feb 2017

Posts: 65
#3

27 May 2021, 07:36

Dear Clyde Schechter , thank you very much for your reply!

Yes, I thought about using 'soundex' but for Brazilian names in a few tests I did it was not very appropriate (as you highlighted, it was intended to be used by the US Census Bureau).

Actually, in the meantime between posting and receiving your reply, I tried reshaping the data and comparing strings using 'strdist' (from SSC), which seems to be a promising way:

id name1 name2 score_strdist

A Jeff Ready Jeffrey Ready 3

B John Luter Schneider John Luther Schneider 1

C Robert D. King Robert King 3

D Claudio Miranda Maria Santos 11

Code:

*ssc install strdist gcontract id name by id: g dup = cond(_N == 1, 1, _n) drop _freq reshape wide name, i(id) j(dup) ustrdist name1 name2, g(score_strdist)

I think I will start eyeballing the outliers in the variable 'score_strdist'.

Last edited by Otavio Conceicao; 27 May 2021, 07:44.
Comment

Julio Raffo

Join Date: May 2014
Posts: 132

27 May 2021, 08:47

Hi Otavio,

Following your example, you can also do this using -matchit- , in two different ways:

1. Very similar to what you did:

Code:

clear
input str1 id str21 name
"A" "Jeff Ready"           
"A" "Jeffrey Ready"        
"B" "John Luther Schneider"
"B" "John Luter Schneider"
"C" "Robert D. King"       
"C" "Robert King"          
"D" "Maria Santos"         
"D" "Claudio Miranda"   
end


bysort id: g dup = cond(_N == 1, 1, _n)
reshape wide name, i(id) j(dup)

matchit name1 name2

2. Alternatively (specially if you have more than one duplicate per id), you can do this:

Code:

clear
input str1 id str21 name
"A" "Jeff Ready"           
"A" "Jeffrey Ready"        
"B" "John Luther Schneider"
"B" "John Luter Schneider"
"C" "Robert D. King"       
"C" "Robert King"          
"D" "Maria Santos"         
"D" "Claudio Miranda"   
end

tempfile myusing
gen n=_n // this is to create an index
save `myusing'
ren (n name) (n2 name2)
joinby id using `myusing'
drop if n<=n2 // this is to drop the elements below the diagonal to avoid running matchit where it is not needed.
drop if name == name2 // this is to exclude the perfect matches (but you can keep it)
matchit name name2

I hope this helps

Comment

Otavio Conceicao

Join Date: Feb 2017

Posts: 65
#5

27 May 2021, 08:55

Thank you very much, Julio Raffo !!

Your solution helps a lot, especially option #2 as I do have up to 6 duplicates per id.
1 like
Comment

id	name
A	Jeff Ready
A	Jeffrey Ready
B	John Luther Schneider
B	John Luter Schneider
C	Robert D. King
C	Robert King
D	Maria Santos
D	Claudio Miranda

id	name1	name2	score_strdist
A	Jeff Ready	Jeffrey Ready	3
B	John Luter Schneider	John Luther Schneider	1
C	Robert D. King	Robert King	3
D	Claudio Miranda	Maria Santos	11

Announcement

How to compare strings as to their similarity in a panel dataset with unique identifier

Comment

Comment

Comment

Comment