Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • On the regular expression of Stata

    I found regular expression of Stata very confusing. For instance:
    Code:
    disp regexm("010-11223344","\d{3}-\d{8}")
    Stata return 0 for the evaluation, then I modified re as:
    Code:
    disp regexm("010-11223344","[0-9]{3}-[0-9]{8}")
    Stata still told me 0 result. Finally, I rewrote:
    Code:
    disp("010-11223344","[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]")
    and Stata return 1.
    I really felt it silly the way writing regular expression in Stata. It will be a catastrophe when we encounter a more complex one.
    Last edited by Summer Xavier; 05 Mar 2020, 08:16.

  • #2
    Well, there are string operators that are easier to use than regular expressions. For example, instead of using

    Code:
    sysuse auto
    generate grp = regexs(1) if regexm(make, "(Datsun|Pont|Toyota)")
    to create a variable == 1 when make contains "Datsun" or "Pont" or "Toyota", one can use...

    Code:
    gen grp2 = ""
    replace grp2 = "Datsun" if strpos(make, "Datsun") > 0
    replace grp2 = "Pont" if strpos(make, "Pont") > 0
    replace grp2 = "Toyota" if strpos(make, "Toyota") > 0
    Sure, it's longer, but it might be less confusing. And I bet there are ways to make my suggestion above even more concise. I guess my point is that there are alternatives to what looks confusing on regular expressions. Is there any complex example you would like help with?

    Comment


    • #3
      Like you, I was initially frustrated with Stata regular expressions. In Version 14 Stata moved to full Unicode compatibility, and introduced Unicode-capable versions of its string functions.

      If you are an experienced user of regular expressions, you will find Stata's Unicode regular expression string functions much more to your liking. Since ASCII strings are a proper subset of Unicode, the Unicode functions work with ASCII strings. See the output of help ustrregexm() for details on the functions and syntax. But to the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.

      Comment


      • #4
        Originally posted by Igor Paploski View Post
        Well, there are string operators that are easier to use than regular expressions. For example, instead of using

        Code:
        sysuse auto
        generate grp = regexs(1) if regexm(make, "(Datsun|Pont|Toyota)")
        to create a variable == 1 when make contains "Datsun" or "Pont" or "Toyota", one can use...

        Code:
        gen grp2 = ""
        replace grp2 = "Datsun" if strpos(make, "Datsun") > 0
        replace grp2 = "Pont" if strpos(make, "Pont") > 0
        replace grp2 = "Toyota" if strpos(make, "Toyota") > 0
        Sure, it's longer, but it might be less confusing. And I bet there are ways to make my suggestion above even more concise. I guess my point is that there are alternatives to what looks confusing on regular expressions. Is there any complex example you would like help with?
        Thanks Bro! Your suggestion helps me a lot and I realized that we can solve problems in Stata in a way of Stata style ^_^

        Comment


        • #5
          To William's point, consider

          Code:
          . disp ustrregexm("010-11223344","[0-9]{3}-[0-9]{8}")
          1
          
          . disp ustrregexm("010-11223344","\d{3}-\d{8}")
          1

          Comment


          • #6
            Originally posted by William Lisowski View Post
            Like you, I was initially frustrated with Stata regular expressions. In Version 14 Stata moved to full Unicode compatibility, and introduced Unicode-capable versions of its string functions.

            If you are an experienced user of regular expressions, you will find Stata's Unicode regular expression string functions much more to your liking. Since ASCII strings are a proper subset of Unicode, the Unicode functions work with ASCII strings. See the output of help ustrregexm() for details on the functions and syntax. But to the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.
            Thank you very much professor! It' s very kind of you giving me so much helpful advice and documentations! .
            There is a joke:
            Some people, when confronted with a problem, think “I know, I’ll use regular expres-
            sions.” Now they have two problems.

            and I told my friend yesterday:
            If you want to solve a problem using regular expressions in Stata, you will have three problems

            Comment


            • #7
              Originally posted by Andrew Musau View Post
              To William's point, consider

              Code:
              . disp ustrregexm("010-11223344","[0-9]{3}-[0-9]{8}")
              1
              
              . disp ustrregexm("010-11223344","\d{3}-\d{8}")
              1
              Yeah, absolutely! Thank you very much professor!!
              Last edited by Summer Xavier; 05 Mar 2020, 09:32.

              Comment

              Working...
              X