On the regular expression of Stata

Summer Xavier

Join Date: Mar 2020

Posts: 17
#1

On the regular expression of Stata

05 Mar 2020, 08:14

I found regular expression of Stata very confusing. For instance:

Code:

disp regexm("010-11223344","\d{3}-\d{8}")

Stata return 0 for the evaluation, then I modified re as:

Code:

disp regexm("010-11223344","[0-9]{3}-[0-9]{8}")

Stata still told me 0 result. Finally, I rewrote:

Code:

disp("010-11223344","[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]")

and Stata return 1.
I really felt it silly the way writing regular expression in Stata. It will be a catastrophe when we encounter a more complex one.

Last edited by Summer Xavier; 05 Mar 2020, 08:16.
Tags: None
Igor Paploski

Join Date: Oct 2014

Posts: 174
#2

05 Mar 2020, 08:38

Well, there are string operators that are easier to use than regular expressions. For example, instead of using

Code:

sysuse auto generate grp = regexs(1) if regexm(make, "(Datsun|Pont|Toyota)")

to create a variable == 1 when make contains "Datsun" or "Pont" or "Toyota", one can use...

Code:

gen grp2 = "" replace grp2 = "Datsun" if strpos(make, "Datsun") > 0 replace grp2 = "Pont" if strpos(make, "Pont") > 0 replace grp2 = "Toyota" if strpos(make, "Toyota") > 0

Sure, it's longer, but it might be less confusing. And I bet there are ways to make my suggestion above even more concise. I guess my point is that there are alternatives to what looks confusing on regular expressions. Is there any complex example you would like help with?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

05 Mar 2020, 08:53

Like you, I was initially frustrated with Stata regular expressions. In Version 14 Stata moved to full Unicode compatibility, and introduced Unicode-capable versions of its string functions.

If you are an experienced user of regular expressions, you will find Stata's Unicode regular expression string functions much more to your liking. Since ASCII strings are a proper subset of Unicode, the Unicode functions work with ASCII strings. See the output of help ustrregexm() for details on the functions and syntax. But to the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.
Comment
Summer Xavier

Join Date: Mar 2020

Posts: 17
#4

05 Mar 2020, 09:13

Originally posted by Igor Paploski View Post

Well, there are string operators that are easier to use than regular expressions. For example, instead of using

Code:

sysuse auto generate grp = regexs(1) if regexm(make, "(Datsun|Pont|Toyota)")

to create a variable == 1 when make contains "Datsun" or "Pont" or "Toyota", one can use...

Code:

gen grp2 = "" replace grp2 = "Datsun" if strpos(make, "Datsun") > 0 replace grp2 = "Pont" if strpos(make, "Pont") > 0 replace grp2 = "Toyota" if strpos(make, "Toyota") > 0

Sure, it's longer, but it might be less confusing. And I bet there are ways to make my suggestion above even more concise. I guess my point is that there are alternatives to what looks confusing on regular expressions. Is there any complex example you would like help with?

Thanks Bro! Your suggestion helps me a lot and I realized that we can solve problems in Stata in a way of Stata style ^_^
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10195
#5

05 Mar 2020, 09:22

To William's point, consider

Code:

. disp ustrregexm("010-11223344","[0-9]{3}-[0-9]{8}") 1 . disp ustrregexm("010-11223344","\d{3}-\d{8}") 1
Comment
Summer Xavier

Join Date: Mar 2020

Posts: 17
#6

05 Mar 2020, 09:25

Originally posted by William Lisowski View Post

Like you, I was initially frustrated with Stata regular expressions. In Version 14 Stata moved to full Unicode compatibility, and introduced Unicode-capable versions of its string functions.

If you are an experienced user of regular expressions, you will find Stata's Unicode regular expression string functions much more to your liking. Since ASCII strings are a proper subset of Unicode, the Unicode functions work with ASCII strings. See the output of help ustrregexm() for details on the functions and syntax. But to the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.

Thank you very much professor! It' s very kind of you giving me so much helpful advice and documentations! .
There is a joke:
Some people, when confronted with a problem, think “I know, I’ll use regular expres-
sions.” Now they have two problems.
and I told my friend yesterday:
If you want to solve a problem using regular expressions in Stata, you will have three problems
1 like
Comment
Summer Xavier

Join Date: Mar 2020

Posts: 17
#7

05 Mar 2020, 09:27

Originally posted by Andrew Musau View Post

To William's point, consider

Code:

. disp ustrregexm("010-11223344","[0-9]{3}-[0-9]{8}") 1 . disp ustrregexm("010-11223344","\d{3}-\d{8}") 1

Yeah, absolutely! Thank you very much professor!!

Last edited by Summer Xavier; 05 Mar 2020, 09:32.
Comment

Announcement

On the regular expression of Stata

Comment

Comment

Comment

Comment

Comment

Comment