regexp: nested *?+ error on Mac

James Chikelu

Join Date: Sep 2024

Posts: 4
#1

regexp: nested *?+ error on Mac

22 Sep 2024, 12:25

I have hundreds of names that are repeated in my dataset, as shown below:

name
Alan B. MenkesAlan B. Menkes
Albert Hill IV Albert Hill IV
Alexander MichaelAlexander Michael
Alexis Borisy Alexis Borisy
Ali Bouzarif Ali Bouzarif
Ali Satvat Ali Satvat
Allen I. QuestromAllen I. Questrom
Allen M. Gibson Allen M. Gibson
Amara Suebsaeng Amara Suebsaeng
Ambyr O DonnellAmbyr O Donnell
Amir Moftakhar Amir Moftakhar
Amortisation Amortisation
Anand GalaAnand Gala
Andr Pienaar Andr Pienaar
Andrea L. Saia Andrea L. Saia
Andrew C. PearsonAndrew C. Pearson

Some names have spaces before it is repeated, others do not have any space.

I used the code block below to remove the repeated names, but the code does not run on a Mac and Windows PC running stata 16 and 14 respectively. I ran it successfully on my Windows with STATA 18.

What could be the issue, why does the regex return the error - "regexp: nested *?+" when I run it on the above PCs with those stata versions?

Kindly help if you have any ideas.

gen name_v2 = name
gen same_word_rep = 0 // To flag repeated words that may or may not be names like "Tian Tian" "Chow Chow"
replace same_word_rep = 1 if regexm(name, "^(\b\w+\b)\s*\1$")
gen name_rep = 0
replace name_rep = 1 if regexm(name, "^(.*?)(?:\s?\1)+$")
replace name_v2 = regexs(1) if regexm(name, "^(.*?)(?:\s?\1)+$") & same_word_rep == 0
br name name_v2 if name_rep == 1 // Verify
drop name same_word_rep name_rep
rename name_v2 name

Thank you.
Tags: None
Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#2

22 Sep 2024, 13:32

In Stata 18, the byte-stream-based functions are updated to use the Boost library as the engine. The ( ICU engine) unicode regular expression functions ustrregexm(), ustrregexrf(), ustrregexra(), and ustrregexs() was added in version 14 thus using those should be fine in version 18 and 14.
1 like
Comment
James Chikelu

Join Date: Sep 2024

Posts: 4
#3

31 Mar 2025, 15:28

Thank you. This solution worked.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#4

31 Mar 2025, 20:39

You can do this without all the rigmarole of regular expressions. You just need to check whether the first half of the value of name is also found in the second half.

Code:

assert name == trim(name) // VERIFY NO LEADING OR TRAILING SPACES gen name_length = floor(strlen(name)/2) gen wanted = substr(name, 1, name_length) replace wanted = cond(strpos(substr(name, name_length+1, .), wanted), wanted, name)

Note: This code will set wanted equal to the original value of name when the value of name is not duplicated.
Comment

Announcement

regexp: nested *?+ error on Mac

Comment

Comment

Comment