Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • regexp: nested *?+ error on Mac

    I have hundreds of names that are repeated in my dataset, as shown below:

    name
    Alan B. MenkesAlan B. Menkes
    Albert Hill IV Albert Hill IV
    Alexander MichaelAlexander Michael
    Alexis Borisy Alexis Borisy
    Ali Bouzarif Ali Bouzarif
    Ali Satvat Ali Satvat
    Allen I. QuestromAllen I. Questrom
    Allen M. Gibson Allen M. Gibson
    Amara Suebsaeng Amara Suebsaeng
    Ambyr O DonnellAmbyr O Donnell
    Amir Moftakhar Amir Moftakhar
    Amortisation Amortisation
    Anand GalaAnand Gala
    Andr Pienaar Andr Pienaar
    Andrea L. Saia Andrea L. Saia
    Andrew C. PearsonAndrew C. Pearson

    Some names have spaces before it is repeated, others do not have any space.

    I used the code block below to remove the repeated names, but the code does not run on a Mac and Windows PC running stata 16 and 14 respectively. I ran it successfully on my Windows with STATA 18.

    What could be the issue, why does the regex return the error - "regexp: nested *?+" when I run it on the above PCs with those stata versions?

    Kindly help if you have any ideas.

    gen name_v2 = name
    gen same_word_rep = 0 // To flag repeated words that may or may not be names like "Tian Tian" "Chow Chow"
    replace same_word_rep = 1 if regexm(name, "^(\b\w+\b)\s*\1$")
    gen name_rep = 0
    replace name_rep = 1 if regexm(name, "^(.*?)(?:\s?\1)+$")
    replace name_v2 = regexs(1) if regexm(name, "^(.*?)(?:\s?\1)+$") & same_word_rep == 0
    br name name_v2 if name_rep == 1 // Verify
    drop name same_word_rep name_rep
    rename name_v2 name

    Thank you.

  • #2
    In Stata 18, the byte-stream-based functions are updated to use the Boost library as the engine. The ( ICU engine) unicode regular expression functions ustrregexm(), ustrregexrf(), ustrregexra(), and ustrregexs() was added in version 14 thus using those should be fine in version 18 and 14.

    Comment


    • #3
      Thank you. This solution worked.

      Comment


      • #4
        You can do this without all the rigmarole of regular expressions. You just need to check whether the first half of the value of name is also found in the second half.

        Code:
        assert name == trim(name) // VERIFY NO LEADING OR TRAILING SPACES
        
        gen name_length = floor(strlen(name)/2)
        gen wanted = substr(name, 1, name_length)
        replace wanted = cond(strpos(substr(name, name_length+1, .), wanted), wanted, name)
        Note: This code will set wanted equal to the original value of name when the value of name is not duplicated.

        Comment

        Working...
        X