While updating a project to use Unicode regular expressions I managed to crash Stata. Here is a toy example that shows what I learned. Suppose you are trying to detect numbers in braces. The following one-liner works fine and prints 12:
Calling the Unicode versions with exactly the same arguments crashes Stata 14.2 on Windows 10, Mac OS X, and Red Hat Linux.
Can you spot the problem? There are actually three things going on here.
1. The Unicode ustrregexm() returns a negative number if an error occurs. Here it returns -1, which happens to be true. We need to check for > 0.
2. The error occurs because ustrregexm() requires escaping literal braces, whereas regexm() works with or without escaping. (I think this is because we can now use patterns such as "[0-9]{3}" to match 3 or more digits.)
3. The function ustrregexs(n) will crash if called before a valid match or with an invalid group number, whereas regexs() will print an error message. (regexs() can also be called with no arguments to return all groups.)
Taking care of 1 and 2 avoids 3. The corrected one-liner is
It would be nice if ustrregexs() was updated to check its argument as regexs() does. As things stand, all it takes to crash a fresh Stata session is mata: ustrregexs(0).
Germán
Code:
mata: regexm("{12}", "{([0-9]+)}") ? regexs(1) : ""
Code:
mata: ustrregexm("{12}", "{([0-9]+)}") ? ustrregexs(1) : ""
1. The Unicode ustrregexm() returns a negative number if an error occurs. Here it returns -1, which happens to be true. We need to check for > 0.
2. The error occurs because ustrregexm() requires escaping literal braces, whereas regexm() works with or without escaping. (I think this is because we can now use patterns such as "[0-9]{3}" to match 3 or more digits.)
3. The function ustrregexs(n) will crash if called before a valid match or with an invalid group number, whereas regexs() will print an error message. (regexs() can also be called with no arguments to return all groups.)
Taking care of 1 and 2 avoids 3. The corrected one-liner is
Code:
mata: ustrregexm("{12}", "\{([0-9]+)\}") > 0 ? ustrregexs(1) : ""
Germán
Comment