Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to match the entire word instead of letters

    Hi all, I am currently trying to match a dataset, so I was using regexm, e.g:

    regexm("12345", "([0-9]){5}") = 1
    regexm("Hong Kai Cheng", "Chen") = 1
    regexm("Hong Kai Cheng", "She") = 0
    regexm("Hong Kai Cheng", "Cheng") = 1


    but what if I only want to match the entire word, e.g. I do not wish to to match "Chen" in "Hong Kai Cheng", only matching "Cheng" in "Hong Kai Cheng".
    what code can I use?

  • #2
    See the very recent thread https://www.statalist.org/forums/for...-in-local-list for precisely this problem.

    One solution is epitomized by looking for " Cheng " within " " + strvar + " " -- where strvar is a string variable.

    Comment


    • #3
      If you are interested in using regular expression functions to solve this problem, the first step is to replace regexm() with ustrregexm().

      The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. In the Statlist post linked here we are told that Stata's Unicode regular expression parser is the ICU regular expression engine documented here. A comprehensive discussion of regular expressions can be found here.

      A good introduction to Stata's Unicode regular expression functions is given by Asjad Naqvi at The Stata Guide. Hua Peng (StataCorp) provides additional examples of advanced techniques in his github blog.

      The example below demonstrates the use of the "\b" metacharacter to match to "word boundaries" - which includes spaces and punctuation and the beginning and end of a string.

      Code:
      . * old regular expression function
      . display regexm("Hong Kai Cheng", "Chen")
      1
      
      . display regexm("Hong Kai Cheng", "Cheng")
      1
      
      . * Unicode regular expression function
      . display ustrregexm("Hong Kai Cheng", "\bChen\b")
      0
      
      . display ustrregexm("Hong Kai Cheng", "\bCheng\b")
      1

      Comment


      • #4
        Also see #3 for a general method that is robust to punctuation characters that delimit words: https://www.statalist.org/forums/for...g-observations

        Comment


        • #5
          De gustibus non disputandum est, but I want to be clear that the word-break technique from post #2, when combined with doing all matching in lower-case to avoid capitalization issues, produces the same results as the code referenced in post #3 at https://www.statalist.org/forums/for...g-observations. It won't do it, though, using regexm() rather than ustrregexm().

          Code:
          . input strL tagline
          
                 tagline
            1. "Cable-free live TV is here. You Tube TV"
            2. "Join a better network! Because better matters. Verizon"
            3. "COLONEL QUALITY GUARANTEED. KFC"
            4. "Goodyear, more driven."
            5. "15 minutes could save you 15% or more on car insurance.GEICO"
            6. "All the News That's Fit to Print. NYT"
            7. "America Runs on Dunkin'. Dunkin' Donuts"
            8. "Imagination at Work. GE"
            9. "CAN YOU EAT THREE SHREDDED WHEAT? – Nabisco"
           10. "one more quality: a broader range of punctuation"
           11. end
          
          . gen match = ustrregexm(lower(tagline), "\b(tv|network|quality|goodyear|dunkin|wheat)\b")
          
          . l, clean
          
                                                                      tagline   match  
            1.                        Cable-free live TV is here. You Tube TV       1  
            2.         Join a better network! Because better matters. Verizon       1  
            3.                                COLONEL QUALITY GUARANTEED. KFC       1  
            4.                                         Goodyear, more driven.       1  
            5.   15 minutes could save you 15% or more on car insurance.GEICO       0  
            6.                          All the News That's Fit to Print. NYT       0  
            7.                        America Runs on Dunkin'. Dunkin' Donuts       1  
            8.                                        Imagination at Work. GE       0  
            9.                    CAN YOU EAT THREE SHREDDED WHEAT? – Nabisco       1  
           10.               one more quality: a broader range of punctuation       1  
          
          .

          Comment


          • #6
            Originally posted by William Lisowski View Post
            If you are interested in using regular expression functions to solve this problem, the first step is to replace regexm() with ustrregexm().

            The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. In the Statlist post linked here we are told that Stata's Unicode regular expression parser is the ICU regular expression engine documented here. A comprehensive discussion of regular expressions can be found here.

            A good introduction to Stata's Unicode regular expression functions is given by Asjad Naqvi at The Stata Guide. Hua Peng (StataCorp) provides additional examples of advanced techniques in his github blog.

            The example below demonstrates the use of the "\b" metacharacter to match to "word boundaries" - which includes spaces and punctuation and the beginning and end of a string.

            Code:
            . * old regular expression function
            . display regexm("Hong Kai Cheng", "Chen")
            1
            
            . display regexm("Hong Kai Cheng", "Cheng")
            1
            
            . * Unicode regular expression function
            . display ustrregexm("Hong Kai Cheng", "\bChen\b")
            0
            
            . display ustrregexm("Hong Kai Cheng", "\bCheng\b")
            1
            thanks for it. Would it work if I have a list of string variable `LastName' that contains all the last names? for example:

            Code:
            . * Unicode regular expression function
            . display ustrregexm("Hong Kai Cheng", "\b`LastName'\b")
            0
            
            . display ustrregexm("Hong Kai Cheng", "\b`LastName'\b")
            1
            and also I saw sometimes people use "^" "$" to denote start of a string and end of a string, in post #5 https://www.statalist.org/forums/for...-in-local-list
            Although when I try using "^$", it does not give me the same result as "\b".


            Comment


            • #7
              Post #6 seems to have been addressed by the poster in the new topic at

              https://www.statalist.org/forums/for...on-command-box

              Comment


              • #8
                Thank you William Lisowski for linking the guide!


                vicky chann you are on the right track but you really need to understand (a) the logic of regex, and (b) what you really want to extract. So don't try random codes! ^ means a start condition, $ means an end condition, \b is a boundary condition. These have very specific uses.

                Here is a sample code for your example:

                Code:
                clear
                set obs 1
                
                gen name = "Hong Kai Cheng"
                
                gen lastname1 = ustrregexs(0) if ustrregexm("Hong Kai Cheng", "Cheng")
                gen lastname2 = ustrregexs(0) if ustrregexm("Hong Kai Cheng", "Chen\w+")
                Also note that ustrregexs(0) returns what has been matched. Otherwise if you just use ustrregexm(), it will return a 1 or a 0 (a boolean match).

                Here lastname1 will return Cheng because than is an exact match. If this is exactly what you are looking for, then only use exact match conditions.
                And lastname2 will also return Cheng through a fuzzy match because we are saying find "Chen" followed by any set of letters. This ONLY works if you know for sure that the last name can ONLY be Cheng. Otherwise you can get anything where the first four letters are "Chen".

                It is important to know that regular expressions need to be built up depending on the type of match you want to do. If you want to learn more, then do read the Regex guide. I also have a Stata regex cheathseet that you can download and print for quick references.

                Good luck!
                Asjad
                Last edited by Asjad Naqvi; 10 Sep 2022, 07:30.

                Comment

                Working...
                X