Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching keywords in a string with regexm

    Hello,

    I've been trying to use the regexm command to match certain keywords within a string variable. However, I was wondering if anyone knows how to get an exact match for a word within a sentence? I've been using the code below and can figure out how to match something at the beginning or end of the string, but often the word I'm looking for is in the middle of the string. For certain words it will also pick up the word within a another word. (eg. I want to pick up the word STI but it also picks up other words like STITCH or STILL. Thanks. The code I've tried using STATA 13/SE is below:

    gen STI_extracted = 1 if(regexm(lab_comments1, "(STI)|(SHS)|(ROUTINE)"))

    gen STI_extracted = 1 if(regexm(lab_comments1, "(^STI)|(SHS)|(ROUTINE)"))


    Thanks

  • #2
    That is exactly what -regexm- is meant to do. I am sure that there are more efficient methods, but here is one way of extracting a specific word from a list

    Code:
    webuse auto
    *EXTRACT WORD "Buick" from variable "make"
    gen make2= lower(make)
    * to install, type in Stata's command window: ssc install leftalign
    leftalign make2
    forvalues i = 1(1)5{
    gen word`i'=word( make2, `i')
    }
    
    forvalues i = 1(1)5{
    gen buick`i' = inlist(word`i', "buick")
    }
    
    egen Buick= rowtotal( buick1 - buick5)
    replace Buick=1 if Buick>1
    drop buick*
    list make if Buick==1

    Code:
    
    . list make if Buick==1
    
         +---------------+
         | make          |
         |---------------|
      4. | Buick Century |
      5. | Buick Electra |
      6. | Buick LeSabre |
      7. | Buick Opel    |
      8. | Buick Regal   |
         |---------------|
      9. | Buick Riviera |
     10. | Buick Skylark |
         +---------------+

    Comment


    • #3
      When I'm trying to match whole words, I usually find it simpler to pad the input string with spaces. That way, all I have to look for is some text that starts and ends with a space. If punctuation characters can also delimit words, you can create a character class of characters and use that to delimit words (a list of characters within square brackets). For example

      Code:
      "[,\. ]"
      will match either a comma, a period, or a space. The period is interpreted as a wildcard in regular expressions so it must be escaped with a backslash if you want to match the literal character.

      You can also use a negated character class when matching whole words. For example

      Code:
      "[^A-Z]"
      will match any character that is not an uppercase letter.

      Putting this all together, here's a simple example

      Code:
      clear
      input str50 s
      "STI SHS ROUTINE"
      "IS STILL GOOD"
      "SHS HERE"
      "MORE ROUTINE."
      "SHS, NOW WHAT"
      "SHS."
      "IS SHS? GOOD"
      end
      
      * match either " STI " or " SHS " or " ROUTINE "
      gen match1 = regexm(" " + s + " ", " (STI|SHS|ROUTINE) ")
      
      * words can also be delimited by punctuation characters. The period
      * is a wildcard so it must be escaped with a backslash
      gen match2 = regexm(" " + s + " ", "[,\. ](STI|SHS|ROUTINE)[,\. ]")
      
      * you can also use negated character classes (the "^" indicates to match any
      * character that is not in the character class). 
      gen match3 = regexm(" " + s + " ", "[^A-Z](STI|SHS|ROUTINE)[^A-Z]")
      
      list
      and the results
      Code:
      . list
      
           +--------------------------------------------+
           |               s   match1   match2   match3 |
           |--------------------------------------------|
        1. | STI SHS ROUTINE        1        1        1 |
        2. |   IS STILL GOOD        0        0        0 |
        3. |        SHS HERE        1        1        1 |
        4. |   MORE ROUTINE.        0        1        1 |
        5. |   SHS, NOW WHAT        0        1        1 |
           |--------------------------------------------|
        6. |            SHS.        0        1        1 |
        7. |    IS SHS? GOOD        0        0        1 |
           +--------------------------------------------+

      Comment


      • #4
        Thank you! Both codes were very helpful.

        Comment

        Working...
        X