Is there an obvious way to supress the per observation errors which arise from a bad regexm()?

Malcolm Wardlaw

Join Date: Apr 2014

Posts: 46
#1

Is there an obvious way to supress the per observation errors which arise from a bad regexm()?

18 Nov 2019, 10:31

Whenever I perform a regex match via say

Code:

gen var2 = regexm(var1,"(stuff")

, the log will spit out bad regex error for every single bad item, which for large datasets fills the console and the entire log with useless information, rendering the console useless because it exceeds the scroll buffer, and making the log painfully large and difficult to parse.

Here is a simple example:

Code:

clear sysuse auto gen ford = regexm(make, "(Ford|Linc|Merc")

This will generate a stream of regexp: unterminated () errors in red. In nearly every case, this is unessessary and only one error would be sufficient to alert the user to the il-formed regex.

Is there an obvious way to supress this error? The regexm() function appread to be a built-in function, so I don't think the function can either be edited and then ghosted as a personal function.

I also don't quite know how one would put a wrapper around the function to supress the errors. One could capture the generate function, but the errors spit out by regexm() don't appear to insert themselves into _rc, so I have no way of knowing if it failed. It would also be nice to have a wrapper for the function itself so that one doesn't need to write a new generate function.
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

18 Nov 2019, 11:31

Your best bet is to switch to the Unicode regular expression functions.

Code:

. clear

. sysuse auto
(1978 Automobile Data)

. gen ford = ustrregexm(make, "(Ford|Linc|Merc")

. list make ford in 21/40, clean

       make                ford  
 21.   Dodge Diplomat        -1  
 22.   Dodge Magnum          -1  
 23.   Dodge St. Regis       -1  
 24.   Ford Fiesta           -1  
 25.   Ford Mustang          -1  
 26.   Linc. Continental     -1  
 27.   Linc. Mark V          -1  
 28.   Linc. Versailles      -1  
 29.   Merc. Bobcat          -1  
 30.   Merc. Cougar          -1  
 31.   Merc. Marquis         -1  
 32.   Merc. Monarch         -1  
 33.   Merc. XR-7            -1  
 34.   Merc. Zephyr          -1  
 35.   Olds 98               -1  
 36.   Olds Cutl Supr        -1  
 37.   Olds Cutlass          -1  
 38.   Olds Delta 88         -1  
 39.   Olds Omega            -1  
 40.   Olds Starfire         -1  

. replace ford = ustrregexm(make, "(Ford|Linc|Merc)")
(74 real changes made)

. list make ford in 21/40, clean

       make                ford  
 21.   Dodge Diplomat         0  
 22.   Dodge Magnum           0  
 23.   Dodge St. Regis        0  
 24.   Ford Fiesta            1  
 25.   Ford Mustang           1  
 26.   Linc. Continental      1  
 27.   Linc. Mark V           1  
 28.   Linc. Versailles       1  
 29.   Merc. Bobcat           1  
 30.   Merc. Cougar           1  
 31.   Merc. Marquis          1  
 32.   Merc. Monarch          1  
 33.   Merc. XR-7             1  
 34.   Merc. Zephyr           1  
 35.   Olds 98                0  
 36.   Olds Cutl Supr         0  
 37.   Olds Cutlass           0  
 38.   Olds Delta 88          0  
 39.   Olds Omega             0  
 40.   Olds Starfire          0

The output of help ustrregexm()documents all four of these functions. Since ASCII characters 1-127 are a proper subset of Unicode, it works fine with those strings. For strings using Extended ASCII characters from 128-255, not so much - the string needs to be converted to Unicode first.

The real benefit of the Unicode regular expression functions is their much more powerful definition of regular expressions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.

Announcement

Is there an obvious way to supress the per observation errors which arise from a bad regexm()?

Comment