Regular expression operators

Demetris Christodoulou

Join Date: Apr 2015

Posts: 11
#1

Regular expression operators

30 Aug 2016, 14:28

Dear Statalisters,

Do you know if Stata or Mata supports POSIX character cases or Perl‐Style meta-characters in its regular expression functions? A Stata FAQ acknowledges these operators but it is unclear whether Stata supports them or not (http://www.stata.com/support/faqs/da...r-expressions/).

Also, in help string functions the Unicode regular expression example uses the syntax {n} which in standard regex is expected to return exactly n number of characters but instead it returns the n^th position of the string. This is confusing. If Stata does not support Perl-style meta-characters, then can you please advise whether there is a way to specify word boundaries, backreferences and assertions (lookahead and lookbehind) using current syntax? I am aware of moss.ado that somehow addresses some of these operations using strpos() but my question is to please clarify which regular expression operators are supported. Are the core operators described in the above FAQ the only ones supported? If yes, then can the good people of Stata consider adding more functionality in regex?

thanks, Demetris Christodoulou
Tags: None
Joe Canner

Join Date: Mar 2014

Posts: 580
#2

30 Aug 2016, 15:12

Demetris,

My understanding is that what is in that FAQ is all that there is, i.e., POSIX and Perl extensions to this basic functionality are not supported.

However, you are correct that ustrregexm() (the Unicode version of regexm()) includes additional functionality, but I'm not sure where this is documented. Somehow (I don't remember how at this point) I figured out that the following could be used, for example, to do a conditional lookahead:

Code:

replace RealName=ustrregexs(2) if ustrregexm(stuff,"ProfileHeaderCard-nameLink(.+?)>([A-Za-z ]*)<")

As you probably know from using other regex standards, the .+? allows me to slurp up everything until the next >, rather than slurping up everything until the end of the string, which is the behavior of regexm().

At the 2015 Stata Conference, one or more of us requested that StataCorp expand the regex offerings to be more in line with other implementations. As I recall, the response was mixed: neither definitively negative or positive. At any rate, perhaps a query to Tech Support might be in order, both to get an answer to your specific question about {n} and to put in a plug for more documentation.

Regards,
Joe
Comment
Dimitriy V. Masterov

Join Date: Mar 2014

Posts: 609
#3

30 Aug 2016, 15:27

There are some details on the non-Unicode regular expression parser. The Unicode parser is based ICU standard, but I cannot find where I found that documented.

There are also user-written tools that give you access to other parsers.
Comment
Demetris Christodoulou

Join Date: Apr 2015

Posts: 11
#4

30 Aug 2016, 16:49

Thanks Joe and Dimitriy for your responses. I am very glad to learn that ustrregexm() supports lookahead assertions. I am also parsing html source code and this solves precisely one of my problems. The Statalist thread link is also very helpful, which appears to be an official position from StataCorp. I am organising the Oceania SUGM in 4 weeks and I will make sure that the request for more functionality in regex finds its way into the wish list too. With the rise of text analytics regular expressions are bound to become more important in Stata too.
Comment

Announcement

Regular expression operators

Comment

Comment

Comment