I have hundreds of names that are repeated in my dataset, as shown below:
name
Alan B. MenkesAlan B. Menkes
Albert Hill IV Albert Hill IV
Alexander MichaelAlexander Michael
Alexis Borisy Alexis Borisy
Ali Bouzarif Ali Bouzarif
Ali Satvat Ali Satvat
Allen I. QuestromAllen I. Questrom
Allen M. Gibson Allen M. Gibson
Amara Suebsaeng Amara Suebsaeng
Ambyr O DonnellAmbyr O Donnell
Amir Moftakhar Amir Moftakhar
Amortisation Amortisation
Anand GalaAnand Gala
Andr Pienaar Andr Pienaar
Andrea L. Saia Andrea L. Saia
Andrew C. PearsonAndrew C. Pearson
Some names have spaces before it is repeated, others do not have any space.
I used the code block below to remove the repeated names, but the code does not run on a Mac and Windows PC running stata 16 and 14 respectively. I ran it successfully on my Windows with STATA 18.
What could be the issue, why does the regex return the error - "regexp: nested *?+" when I run it on the above PCs with those stata versions?
Kindly help if you have any ideas.
gen name_v2 = name
gen same_word_rep = 0 // To flag repeated words that may or may not be names like "Tian Tian" "Chow Chow"
replace same_word_rep = 1 if regexm(name, "^(\b\w+\b)\s*\1$")
gen name_rep = 0
replace name_rep = 1 if regexm(name, "^(.*?)(?:\s?\1)+$")
replace name_v2 = regexs(1) if regexm(name, "^(.*?)(?:\s?\1)+$") & same_word_rep == 0
br name name_v2 if name_rep == 1 // Verify
drop name same_word_rep name_rep
rename name_v2 name
Thank you.
name
Alan B. MenkesAlan B. Menkes
Albert Hill IV Albert Hill IV
Alexander MichaelAlexander Michael
Alexis Borisy Alexis Borisy
Ali Bouzarif Ali Bouzarif
Ali Satvat Ali Satvat
Allen I. QuestromAllen I. Questrom
Allen M. Gibson Allen M. Gibson
Amara Suebsaeng Amara Suebsaeng
Ambyr O DonnellAmbyr O Donnell
Amir Moftakhar Amir Moftakhar
Amortisation Amortisation
Anand GalaAnand Gala
Andr Pienaar Andr Pienaar
Andrea L. Saia Andrea L. Saia
Andrew C. PearsonAndrew C. Pearson
Some names have spaces before it is repeated, others do not have any space.
I used the code block below to remove the repeated names, but the code does not run on a Mac and Windows PC running stata 16 and 14 respectively. I ran it successfully on my Windows with STATA 18.
What could be the issue, why does the regex return the error - "regexp: nested *?+" when I run it on the above PCs with those stata versions?
Kindly help if you have any ideas.
gen name_v2 = name
gen same_word_rep = 0 // To flag repeated words that may or may not be names like "Tian Tian" "Chow Chow"
replace same_word_rep = 1 if regexm(name, "^(\b\w+\b)\s*\1$")
gen name_rep = 0
replace name_rep = 1 if regexm(name, "^(.*?)(?:\s?\1)+$")
replace name_v2 = regexs(1) if regexm(name, "^(.*?)(?:\s?\1)+$") & same_word_rep == 0
br name name_v2 if name_rep == 1 // Verify
drop name same_word_rep name_rep
rename name_v2 name
Thank you.
Comment