Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Wildcards with strpos/regexr

    Could someone give me an example on using a wildcard with strpos and regexr ? For example, I want to scan a string variable (with multiple words) called meds for nu seal, nu-seal and nuseal, (or variants thereon) and replace with aspirin, . eg.

    replace meds=regexr(meds,"nu[ -]seal", "aspirin") works for nu-seal and nu seal, but doesn't include "nuseal". I know I could write another line, but there are instances where I'd like to incorporate a wildcard into the one regex command. Same holds for strpos. Any pointers would be very much appreciated!

  • #2
    The best way to learn functions like this is not to generate or replace variables but to use display with examples where you can work out the answer you want and check whether you get it.

    strpos() works only with literal matches. But preceding and following text are entirely possible; that's part of the point.

    Code:
    . di strpos("frog science", "frog")
    1
    
    . di strpos("toad frog newt", "frog")
    6
    
    . di strpos("unicorn", "frog")
    0
    If your interest is only whether a string is found as a substring in text, then note that

    Code:
    ... if strpos("frog science", "frog")  > 0
    
    ... if strpos("frog science", "frog")
    are equivalent as logical tests, as a non-zero argument counts as true. That's the good news, but the bad news for your problem is that all possible variants need to be tested separately; I can't see a way to use strpos() otherwise.

    I will pass on the regular expression syntax given a meeting in a few minutes....

    EDIT: That was a short meeting!

    Code:
    . di regexr("nuseal","nu(.*)seal", "aspirin")
    aspirin
    
    . di regexr("nu-seal","nu(.*)seal", "aspirin")
    aspirin
    
    . di regexr("nu seal","nu(.*)seal", "aspirin")
    aspirin


    Last edited by Nick Cox; 15 Oct 2015, 04:44.

    Comment


    • #3
      Thanks a lot Nick-that is just what I'm looking for! One final addendum-suppose the text I want to overwrite has brackets as part of the text-and I want to use regex to replace e.g "(nuseal)" with "aspirin". Is this possible? When I try replace meds=regexr(meds,"(nuseal)", "aspirin") I get (aspirin) and not aspirin.

      Comment


      • #4
        The problem is that parentheses have syntactic meaning in regular expressions; that's why they appeared as such in the first solution. There is a syntax of escape characters to insist that you want literal matches, but it is easy enough to avoid all that strip out parentheses () with subinstr().

        Writing a regular expression to match absolutely all possibilities is appealing to some tastes, but I would divide the task into smaller tasks.

        Comment


        • #5
          On the topic of matching variations of a string: Note that the function strmatch() allows the use of wildcards (* or ?), and is sometimes a bit faster than using a regular expression. Of course it doesn't have the flexibility provided by regexm(), or the ability to then substitute/replace using regexs() or regexr(), but it is a bit more flexible than strpos().

          Code:
          // note that strmatch assumes that s2 is the beginning & end
          // of the entire string, unless you explicitly supply wildcards
          // to tell it otherwise
          
          // e.g., "nu*seal" will properly match ex. 1, 2, and 4, but not 3 or 5
          foreach s in "nuseal" "nu seal" "(nu)seal" "nu-seal" "..nu seal.." {
              di "found in `s'?" strmatch("`s'", "nu*seal")
          }
          
          // adding "*" to the beginning and end of s2 fixes this:
          foreach s in "nuseal" "nu seal" "(nu)seal" "nu-seal" "..nu seal.." {
              di "found in `s'?  " strmatch("`s'", "*nu*seal*")
          }

          Comment


          • #6
            Thanks both of you for these very helpful comments. I have found use of delimiters helpful in removing extraneous parantheses e.g. replace meds=regexr(meds,"\((ec)+\)","ec")

            Comment


            • #7
              Is it possible to put exclusions on wildcards? For example, supposing I am searching a string variable for variants of nu-seal. I don't want it to return variants where nu and seal are broken by an alphanumeric character, but would like it to return all others?

              Comment


              • #8
                Indeed. You can just complicate the expression to be matched or use a compound condition.

                Here's a stupid example. You want to catch "Stata" but not if "user" is mentioned. So "Stata user" qualifies on the first rule, but is disqualified on the second.

                Code:
                . di strpos("Stata user", "Stata") & !strpos("Stata user", "user")
                0
                The example here uses one function, but the principle carries over to similar functions.

                Comment


                • #9
                  Thanks for this posting. How would I adjust the code if I wanted to catch "Stata" but not if any of the numbers 0-9 were mentioned?

                  Comment


                  • #10
                    Searching for numeric characters is very well documented. Do see

                    FAQ . . . . . . . . . . . . . . . . . . . . . . . . . Regular expressions
                    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K. S. Turner
                    10/05 What are regular expressions and how can I use
                    them in Stata?
                    http://www.stata.com/support/faqs/data/regex.html

                    Code:
                     
                    . di strmatch("Stata 14", "Stata") & !regexm("Stata 14", "([0-9]+)")
                    0

                    Comment


                    • #11
                      Thanks very much for these postings. I really appreciate the advice and links.

                      Comment


                      • #12
                        Back to my thread as I'm looking for advice again! I'm wanting to use regexr to replace the phrase "new/word" with "newword". However I can't recall how to delimit the / in the expression:
                        replace variable=regexr(variable,"new/word", "newword"). Can anyone advise?

                        Comment


                        • #13
                          See #2 again for the advice to play with examples and display

                          Code:
                          . di regexr("stuff new/word stuff","new/word", "newword")
                          stuff newword stuff


                          The forward slash has no special meaning in regular expressions and can be searched for as a literal character.

                          See #10 again for a link to documentation. Whatever is not a special character ... is not special.

                          Comment

                          Working...
                          X