Dear all,
I am working with a matched employer-employee dataset from Brazil and facing a similar issue to the one Danilo Silva presented in a previous post (https://www.statalist.org/forums/for...-same-variable) with a difference: I observe different names spelling for the same ID in the variable 'name'.
So taking as an example what Danilo showed, my case is as the following with the additional ID denoted by "D":
In particular, I would like to detect when the different name spellings effectively refer to the same person and when we likely have a mistake, i.e., when the same ID probably refers to two different individuals.
In this example, the ID == "D" is such a case where we probably are handling with typo in the variable 'ID' as the name spellings are very different (Maria is presumably a woman and Claudio is presumably a man).
My goal is first to flag and then remove (but after eyeballing to check whether the result is appropriate) observations that have an ID value compatible with the "mistake" described above.
Can you help me figure out how to proceed?
Thank you very much!
I am working with a matched employer-employee dataset from Brazil and facing a similar issue to the one Danilo Silva presented in a previous post (https://www.statalist.org/forums/for...-same-variable) with a difference: I observe different names spelling for the same ID in the variable 'name'.
So taking as an example what Danilo showed, my case is as the following with the additional ID denoted by "D":
id | name |
A | Jeff Ready |
A | Jeffrey Ready |
B | John Luther Schneider |
B | John Luter Schneider |
C | Robert D. King |
C | Robert King |
D | Maria Santos |
D | Claudio Miranda |
In particular, I would like to detect when the different name spellings effectively refer to the same person and when we likely have a mistake, i.e., when the same ID probably refers to two different individuals.
In this example, the ID == "D" is such a case where we probably are handling with typo in the variable 'ID' as the name spellings are very different (Maria is presumably a woman and Claudio is presumably a man).
My goal is first to flag and then remove (but after eyeballing to check whether the result is appropriate) observations that have an ID value compatible with the "mistake" described above.
Can you help me figure out how to proceed?
Thank you very much!
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str1 id str21 name "A" "Jeff Ready" "A" "Jeffrey Ready" "B" "John Luther Schneider" "B" "John Luter Schneider" "C" "Robert D. King" "C" "Robert King" "D" "Maria Santos" "D" "Claudio Miranda" end
Comment