Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regular expression operators

    Dear Statalisters,

    Do you know if Stata or Mata supports POSIX character cases or Perl‐Style meta-characters in its regular expression functions? A Stata FAQ acknowledges these operators but it is unclear whether Stata supports them or not (http://www.stata.com/support/faqs/da...r-expressions/).

    Also, in help string functions the Unicode regular expression example uses the syntax {n} which in standard regex is expected to return exactly n number of characters but instead it returns the n^th position of the string. This is confusing. If Stata does not support Perl-style meta-characters, then can you please advise whether there is a way to specify word boundaries, backreferences and assertions (lookahead and lookbehind) using current syntax? I am aware of moss.ado that somehow addresses some of these operations using strpos() but my question is to please clarify which regular expression operators are supported. Are the core operators described in the above FAQ the only ones supported? If yes, then can the good people of Stata consider adding more functionality in regex?

    thanks, Demetris Christodoulou

  • #2
    Demetris,

    My understanding is that what is in that FAQ is all that there is, i.e., POSIX and Perl extensions to this basic functionality are not supported.

    However, you are correct that ustrregexm() (the Unicode version of regexm()) includes additional functionality, but I'm not sure where this is documented. Somehow (I don't remember how at this point) I figured out that the following could be used, for example, to do a conditional lookahead:

    Code:
    replace RealName=ustrregexs(2) if ustrregexm(stuff,"ProfileHeaderCard-nameLink(.+?)>([A-Za-z ]*)<")
    As you probably know from using other regex standards, the .+? allows me to slurp up everything until the next >, rather than slurping up everything until the end of the string, which is the behavior of regexm().

    At the 2015 Stata Conference, one or more of us requested that StataCorp expand the regex offerings to be more in line with other implementations. As I recall, the response was mixed: neither definitively negative or positive. At any rate, perhaps a query to Tech Support might be in order, both to get an answer to your specific question about {n} and to put in a plug for more documentation.

    Regards,
    Joe

    Comment


    • #3
      There are some details on the non-Unicode regular expression parser. The Unicode parser is based ICU standard, but I cannot find where I found that documented.

      There are also user-written tools that give you access to other parsers.

      Comment


      • #4
        Thanks Joe and Dimitriy for your responses. I am very glad to learn that ustrregexm() supports lookahead assertions. I am also parsing html source code and this solves precisely one of my problems. The Statalist thread link is also very helpful, which appears to be an official position from StataCorp. I am organising the Oceania SUGM in 4 weeks and I will make sure that the request for more functionality in regex finds its way into the wish list too. With the rise of text analytics regular expressions are bound to become more important in Stata too.

        Comment

        Working...
        X