Wildcards with strpos/regexr

RonD McDowell

Join Date: Apr 2015

Posts: 44
#1

Wildcards with strpos/regexr

15 Oct 2015, 03:36

Could someone give me an example on using a wildcard with strpos and regexr ? For example, I want to scan a string variable (with multiple words) called meds for nu seal, nu-seal and nuseal, (or variants thereon) and replace with aspirin, . eg.

replace meds=regexr(meds,"nu[ -]seal", "aspirin") works for nu-seal and nu seal, but doesn't include "nuseal". I know I could write another line, but there are instances where I'd like to incorporate a wildcard into the one regex command. Same holds for strpos. Any pointers would be very much appreciated!
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

15 Oct 2015, 03:53

The best way to learn functions like this is not to generate or replace variables but to use display with examples where you can work out the answer you want and check whether you get it.

strpos() works only with literal matches. But preceding and following text are entirely possible; that's part of the point.

Code:

. di strpos("frog science", "frog") 1 . di strpos("toad frog newt", "frog") 6 . di strpos("unicorn", "frog") 0

If your interest is only whether a string is found as a substring in text, then note that

Code:

... if strpos("frog science", "frog") > 0 ... if strpos("frog science", "frog")

are equivalent as logical tests, as a non-zero argument counts as true. That's the good news, but the bad news for your problem is that all possible variants need to be tested separately; I can't see a way to use strpos() otherwise.

I will pass on the regular expression syntax given a meeting in a few minutes....

EDIT: That was a short meeting!

Code:

. di regexr("nuseal","nu(.*)seal", "aspirin") aspirin . di regexr("nu-seal","nu(.*)seal", "aspirin") aspirin . di regexr("nu seal","nu(.*)seal", "aspirin") aspirin

Last edited by Nick Cox; 15 Oct 2015, 04:44.
Comment
RonD McDowell

Join Date: Apr 2015

Posts: 44
#3

15 Oct 2015, 05:30

Thanks a lot Nick-that is just what I'm looking for! One final addendum-suppose the text I want to overwrite has brackets as part of the text-and I want to use regex to replace e.g "(nuseal)" with "aspirin". Is this possible? When I try replace meds=regexr(meds,"(nuseal)", "aspirin") I get (aspirin) and not aspirin.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

15 Oct 2015, 05:47

The problem is that parentheses have syntactic meaning in regular expressions; that's why they appeared as such in the first solution. There is a syntax of escape characters to insist that you want literal matches, but it is easy enough to avoid all that strip out parentheses () with subinstr().

Writing a regular expression to match absolutely all possibilities is appealing to some tastes, but I would divide the task into smaller tasks.
Comment

Brendan Cox

Join Date: May 2015
Posts: 36

15 Oct 2015, 08:32

On the topic of matching variations of a string: Note that the function strmatch() allows the use of wildcards (* or ?), and is sometimes a bit faster than using a regular expression. Of course it doesn't have the flexibility provided by regexm(), or the ability to then substitute/replace using regexs() or regexr(), but it is a bit more flexible than strpos().

Code:

// note that strmatch assumes that s2 is the beginning & end
// of the entire string, unless you explicitly supply wildcards
// to tell it otherwise

// e.g., "nu*seal" will properly match ex. 1, 2, and 4, but not 3 or 5
foreach s in "nuseal" "nu seal" "(nu)seal" "nu-seal" "..nu seal.." {
    di "found in `s'?" strmatch("`s'", "nu*seal")
}

// adding "*" to the beginning and end of s2 fixes this:
foreach s in "nuseal" "nu seal" "(nu)seal" "nu-seal" "..nu seal.." {
    di "found in `s'?  " strmatch("`s'", "*nu*seal*")
}

Comment

RonD McDowell

Join Date: Apr 2015

Posts: 44
#6

16 Oct 2015, 02:18

Thanks both of you for these very helpful comments. I have found use of delimiters helpful in removing extraneous parantheses e.g. replace meds=regexr(meds,"\((ec)+\)","ec")
Comment
RonD McDowell

Join Date: Apr 2015

Posts: 44
#7

20 Oct 2015, 04:33

Is it possible to put exclusions on wildcards? For example, supposing I am searching a string variable for variants of nu-seal. I don't want it to return variants where nu and seal are broken by an alphanumeric character, but would like it to return all others?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#8

20 Oct 2015, 04:54

Indeed. You can just complicate the expression to be matched or use a compound condition.

Here's a stupid example. You want to catch "Stata" but not if "user" is mentioned. So "Stata user" qualifies on the first rule, but is disqualified on the second.

Code:

. di strpos("Stata user", "Stata") & !strpos("Stata user", "user") 0

The example here uses one function, but the principle carries over to similar functions.
Comment
RonD McDowell

Join Date: Apr 2015

Posts: 44
#9

20 Oct 2015, 05:24

Thanks for this posting. How would I adjust the code if I wanted to catch "Stata" but not if any of the numbers 0-9 were mentioned?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#10

20 Oct 2015, 05:46

Searching for numeric characters is very well documented. Do see

FAQ . . . . . . . . . . . . . . . . . . . . . . . . . Regular expressions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . K. S. Turner
10/05 What are regular expressions and how can I use
them in Stata?
http://www.stata.com/support/faqs/data/regex.html

Code:

. di strmatch("Stata 14", "Stata") & !regexm("Stata 14", "([0-9]+)") 0
Comment
RonD McDowell

Join Date: Apr 2015

Posts: 44
#11

20 Oct 2015, 07:35

Thanks very much for these postings. I really appreciate the advice and links.
Comment
RonD McDowell

Join Date: Apr 2015

Posts: 44
#12

29 Oct 2015, 12:27

Back to my thread as I'm looking for advice again! I'm wanting to use regexr to replace the phrase "new/word" with "newword". However I can't recall how to delimit the / in the expression:
replace variable=regexr(variable,"new/word", "newword"). Can anyone advise?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#13

29 Oct 2015, 12:37

See #2 again for the advice to play with examples and display

Code:

. di regexr("stuff new/word stuff","new/word", "newword") stuff newword stuff

The forward slash has no special meaning in regular expressions and can be searched for as a literal character.

See #10 again for a link to documentation. Whatever is not a special character ... is not special.
Comment

Announcement

Wildcards with strpos/regexr

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment