How to match the entire word instead of letters

vicky chann

Join Date: Aug 2022

Posts: 23
#1

How to match the entire word instead of letters

09 Sep 2022, 00:14

Hi all, I am currently trying to match a dataset, so I was using regexm, e.g:

regexm("12345", "([0-9]){5}") = 1
regexm("Hong Kai Cheng", "Chen") = 1
regexm("Hong Kai Cheng", "She") = 0
regexm("Hong Kai Cheng", "Cheng") = 1

but what if I only want to match the entire word, e.g. I do not wish to to match "Chen" in "Hong Kai Cheng", only matching "Cheng" in "Hong Kai Cheng".
what code can I use?
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35433
#2

09 Sep 2022, 00:52

See the very recent thread https://www.statalist.org/forums/for...-in-local-list for precisely this problem.

One solution is epitomized by looking for " Cheng " within " " + strvar + " " -- where strvar is a string variable.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

09 Sep 2022, 09:43

If you are interested in using regular expression functions to solve this problem, the first step is to replace regexm() with ustrregexm().

The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. In the Statlist post linked here we are told that Stata's Unicode regular expression parser is the ICU regular expression engine documented here. A comprehensive discussion of regular expressions can be found here.

A good introduction to Stata's Unicode regular expression functions is given by Asjad Naqvi at The Stata Guide. Hua Peng (StataCorp) provides additional examples of advanced techniques in his github blog.

The example below demonstrates the use of the "\b" metacharacter to match to "word boundaries" - which includes spaces and punctuation and the beginning and end of a string.

Code:

. * old regular expression function . display regexm("Hong Kai Cheng", "Chen") 1 . display regexm("Hong Kai Cheng", "Cheng") 1 . * Unicode regular expression function . display ustrregexm("Hong Kai Cheng", "\bChen\b") 0 . display ustrregexm("Hong Kai Cheng", "\bCheng\b") 1
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10084
#4

09 Sep 2022, 10:42

Also see #3 for a general method that is robust to punctuation characters that delimit words: https://www.statalist.org/forums/for...g-observations
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

09 Sep 2022, 11:22

De gustibus non disputandum est, but I want to be clear that the word-break technique from post #2, when combined with doing all matching in lower-case to avoid capitalization issues, produces the same results as the code referenced in post #3 at https://www.statalist.org/forums/for...g-observations. It won't do it, though, using regexm() rather than ustrregexm().

Code:

. input strL tagline

       tagline
  1. "Cable-free live TV is here. You Tube TV"
  2. "Join a better network! Because better matters. Verizon"
  3. "COLONEL QUALITY GUARANTEED. KFC"
  4. "Goodyear, more driven."
  5. "15 minutes could save you 15% or more on car insurance.GEICO"
  6. "All the News That's Fit to Print. NYT"
  7. "America Runs on Dunkin'. Dunkin' Donuts"
  8. "Imagination at Work. GE"
  9. "CAN YOU EAT THREE SHREDDED WHEAT? – Nabisco"
 10. "one more quality: a broader range of punctuation"
 11. end

. gen match = ustrregexm(lower(tagline), "\b(tv|network|quality|goodyear|dunkin|wheat)\b")

. l, clean

                                                            tagline   match  
  1.                        Cable-free live TV is here. You Tube TV       1  
  2.         Join a better network! Because better matters. Verizon       1  
  3.                                COLONEL QUALITY GUARANTEED. KFC       1  
  4.                                         Goodyear, more driven.       1  
  5.   15 minutes could save you 15% or more on car insurance.GEICO       0  
  6.                          All the News That's Fit to Print. NYT       0  
  7.                        America Runs on Dunkin'. Dunkin' Donuts       1  
  8.                                        Imagination at Work. GE       0  
  9.                    CAN YOU EAT THREE SHREDDED WHEAT? – Nabisco       1  
 10.               one more quality: a broader range of punctuation       1  

.

Comment

vicky chann

Join Date: Aug 2022

Posts: 23
#6

10 Sep 2022, 02:28

Originally posted by William Lisowski View Post

If you are interested in using regular expression functions to solve this problem, the first step is to replace regexm() with ustrregexm().

The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. In the Statlist post linked here we are told that Stata's Unicode regular expression parser is the ICU regular expression engine documented here. A comprehensive discussion of regular expressions can be found here.

A good introduction to Stata's Unicode regular expression functions is given by Asjad Naqvi at The Stata Guide. Hua Peng (StataCorp) provides additional examples of advanced techniques in his github blog.

The example below demonstrates the use of the "\b" metacharacter to match to "word boundaries" - which includes spaces and punctuation and the beginning and end of a string.

Code:

. * old regular expression function . display regexm("Hong Kai Cheng", "Chen") 1 . display regexm("Hong Kai Cheng", "Cheng") 1 . * Unicode regular expression function . display ustrregexm("Hong Kai Cheng", "\bChen\b") 0 . display ustrregexm("Hong Kai Cheng", "\bCheng\b") 1

thanks for it. Would it work if I have a list of string variable `LastName' that contains all the last names? for example:

Code:

. * Unicode regular expression function . display ustrregexm("Hong Kai Cheng", "\b`LastName'\b") 0 . display ustrregexm("Hong Kai Cheng", "\b`LastName'\b") 1

and also I saw sometimes people use "^" "$" to denote start of a string and end of a string, in post #5 https://www.statalist.org/forums/for...-in-local-list
Although when I try using "^$", it does not give me the same result as "\b".
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

10 Sep 2022, 06:38

Post #6 seems to have been addressed by the poster in the new topic at

https://www.statalist.org/forums/for...on-command-box
1 like
Comment
Asjad Naqvi

Join Date: Oct 2014

Posts: 91
#8

10 Sep 2022, 07:22

Thank you William Lisowski for linking the guide!

vicky chann you are on the right track but you really need to understand (a) the logic of regex, and (b) what you really want to extract. So don't try random codes! ^ means a start condition, $ means an end condition, \b is a boundary condition. These have very specific uses.

Here is a sample code for your example:

Code:

clear set obs 1 gen name = "Hong Kai Cheng" gen lastname1 = ustrregexs(0) if ustrregexm("Hong Kai Cheng", "Cheng") gen lastname2 = ustrregexs(0) if ustrregexm("Hong Kai Cheng", "Chen\w+")

Also note that ustrregexs(0) returns what has been matched. Otherwise if you just use ustrregexm(), it will return a 1 or a 0 (a boolean match).

Here lastname1 will return Cheng because than is an exact match. If this is exactly what you are looking for, then only use exact match conditions.
And lastname2 will also return Cheng through a fuzzy match because we are saying find "Chen" followed by any set of letters. This ONLY works if you know for sure that the last name can ONLY be Cheng. Otherwise you can get anything where the first four letters are "Chen".

It is important to know that regular expressions need to be built up depending on the type of match you want to do. If you want to learn more, then do read the Regex guide. I also have a Stata regex cheathseet that you can download and print for quick references.

Good luck!
Asjad

Last edited by Asjad Naqvi; 10 Sep 2022, 07:30.
3 likes
Comment

Announcement

How to match the entire word instead of letters

Comment

Comment

Comment

Comment

Comment

Comment

Comment