
No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unicode regular expressions: a cautionary tale

    While updating a project to use Unicode regular expressions I managed to crash Stata. Here is a toy example that shows what I learned. Suppose you are trying to detect numbers in braces. The following one-liner works fine and prints 12:

    mata: regexm("{12}", "{([0-9]+)}") ? regexs(1) : ""
    Calling the Unicode versions with exactly the same arguments crashes Stata 14.2 on Windows 10, Mac OS X, and Red Hat Linux.

    mata: ustrregexm("{12}", "{([0-9]+)}") ? ustrregexs(1) : ""
    Can you spot the problem? There are actually three things going on here.

    1. The Unicode ustrregexm() returns a negative number if an error occurs. Here it returns -1, which happens to be true. We need to check for > 0.

    2. The error occurs because ustrregexm() requires escaping literal braces, whereas regexm() works with or without escaping. (I think this is because we can now use patterns such as "[0-9]{3}" to match 3 or more digits.)

    3. The function ustrregexs(n) will crash if called before a valid match or with an invalid group number, whereas regexs() will print an error message. (regexs() can also be called with no arguments to return all groups.)

    Taking care of 1 and 2 avoids 3. The corrected one-liner is

    mata: ustrregexm("{12}", "\{([0-9]+)\}") > 0 ? ustrregexs(1) : ""
    It would be nice if ustrregexs() was updated to check its argument as regexs() does. As things stand, all it takes to crash a fresh Stata session is mata: ustrregexs(0).


  • #2
    Thanks for detailed report of the bug. This will be fixed in a future update.


    • #3
      Terrific, thanks!


      • #4
        I don't know if this is relevant, but the same applies for the non-Mata unicode regex functions; Stata also crashes with this one-liner:
        display ustrregexs(0)


        • #5
          It is the same issue, in fact, they share the same code at core level. The Stata function will be fixed as well.

