Matching keywords in a string with regexm

Christa Smolarchuk

Join Date: Jul 2015

Posts: 6
#1

Matching keywords in a string with regexm

01 Mar 2016, 08:05

Hello,

I've been trying to use the regexm command to match certain keywords within a string variable. However, I was wondering if anyone knows how to get an exact match for a word within a sentence? I've been using the code below and can figure out how to match something at the beginning or end of the string, but often the word I'm looking for is in the middle of the string. For certain words it will also pick up the word within a another word. (eg. I want to pick up the word STI but it also picks up other words like STITCH or STILL. Thanks. The code I've tried using STATA 13/SE is below:

gen STI_extracted = 1 if(regexm(lab_comments1, "(STI)|(SHS)|(ROUTINE)"))

gen STI_extracted = 1 if(regexm(lab_comments1, "(^STI)|(SHS)|(ROUTINE)"))

Thanks
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 9947

01 Mar 2016, 09:00

That is exactly what -regexm- is meant to do. I am sure that there are more efficient methods, but here is one way of extracting a specific word from a list

Code:

webuse auto
*EXTRACT WORD "Buick" from variable "make"
gen make2= lower(make)
* to install, type in Stata's command window: ssc install leftalign
leftalign make2
forvalues i = 1(1)5{
gen word`i'=word( make2, `i')
}

forvalues i = 1(1)5{
gen buick`i' = inlist(word`i', "buick")
}

egen Buick= rowtotal( buick1 - buick5)
replace Buick=1 if Buick>1
drop buick*
list make if Buick==1

Code:


. list make if Buick==1

     +---------------+
     | make          |
     |---------------|
  4. | Buick Century |
  5. | Buick Electra |
  6. | Buick LeSabre |
  7. | Buick Opel    |
  8. | Buick Regal   |
     |---------------|
  9. | Buick Riviera |
 10. | Buick Skylark |
     +---------------+

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

01 Mar 2016, 09:39

When I'm trying to match whole words, I usually find it simpler to pad the input string with spaces. That way, all I have to look for is some text that starts and ends with a space. If punctuation characters can also delimit words, you can create a character class of characters and use that to delimit words (a list of characters within square brackets). For example

Code:

"[,\. ]"

will match either a comma, a period, or a space. The period is interpreted as a wildcard in regular expressions so it must be escaped with a backslash if you want to match the literal character.

You can also use a negated character class when matching whole words. For example

Code:

"[^A-Z]"

will match any character that is not an uppercase letter.

Putting this all together, here's a simple example

Code:

clear
input str50 s
"STI SHS ROUTINE"
"IS STILL GOOD"
"SHS HERE"
"MORE ROUTINE."
"SHS, NOW WHAT"
"SHS."
"IS SHS? GOOD"
end

* match either " STI " or " SHS " or " ROUTINE "
gen match1 = regexm(" " + s + " ", " (STI|SHS|ROUTINE) ")

* words can also be delimited by punctuation characters. The period
* is a wildcard so it must be escaped with a backslash
gen match2 = regexm(" " + s + " ", "[,\. ](STI|SHS|ROUTINE)[,\. ]")

* you can also use negated character classes (the "^" indicates to match any
* character that is not in the character class). 
gen match3 = regexm(" " + s + " ", "[^A-Z](STI|SHS|ROUTINE)[^A-Z]")

list

and the results

Code:

. list

     +--------------------------------------------+
     |               s   match1   match2   match3 |
     |--------------------------------------------|
  1. | STI SHS ROUTINE        1        1        1 |
  2. |   IS STILL GOOD        0        0        0 |
  3. |        SHS HERE        1        1        1 |
  4. |   MORE ROUTINE.        0        1        1 |
  5. |   SHS, NOW WHAT        0        1        1 |
     |--------------------------------------------|
  6. |            SHS.        0        1        1 |
  7. |    IS SHS? GOOD        0        0        1 |
     +--------------------------------------------+

Comment

Christa Smolarchuk

Join Date: Jul 2015

Posts: 6
#4

01 Mar 2016, 10:33

Thank you! Both codes were very helpful.
Comment

Announcement