Just to clarify as I saw there might be some confusion. Before Stata 14, Stata's only has one set of regular expression function, regexm(), regexr() and regexs(), which uses an implementation of Henry Spencer's NFA algorithm, which in turn is a nearly a subset POSIX standard. In Stata 14, we added 4 Unicode regular expression functions ustrregexm(), ustrregexrf(), ustrregexra(), and ustrregexs(). These 4 new functions are using the ICU regular expression engine. If you are interested in the comparison of different regular expression engines, see the following wiki page: https://en.wikipedia.org/wiki/Compar...ession_engines
The obvious question is why maintaining two different set of functions, especially given that ICU engine is a far superior engine. The short answer is the two set of functions are essentially incompatible. And I suspect Bill Buchanan's jregex will suffer the same issue. Let me explain.
The regex*() set of functions treat a string as a byte stream and is encoding neutral, i.e., it does not assume any encoding of the string, it deals bytes. On the other hand, ICU based ustrreg*() functions assume the string is UTF-8 encoded. The is due to that ICU engine only works with UTF-16 encoded string, hence a conversion of the original string must be performed before passing it to ICU. As any conversion of string to a particular encoding, you have to assume a source encoding. Since Stata 14 uses UTF-8 encoding, the UTF-8 encoding is assumed. The side effects is that if teh assumption is wrong, for example, your source string is encoded in Latin-1, then the new ustrreg*() function will not work the conversion will lose information of the original string. Another situation is that your original "string" is really a byte stream and has no text meaning at all, for example, a byte stream of an image file, new ustrreg*() function will not work either since the conversion will almost surely destroy the string.
Hence the two set of functions basically deal with different cases, regex*() are for byte streams, ustrregex*() are for texts. In the case if your data is all ASCII (English a-z, A-Z, 0-9, and punctuation), both will work. But regex*() are faster, ustrreg*() supports standard regular expression syntax better.
Since Java uses UTF-16 as its internal String encoding as well, -jregex- probably will suffer the same issue if it uses Java String class (this is purely guess as I have not had time to play with jregex).
The obvious question is why maintaining two different set of functions, especially given that ICU engine is a far superior engine. The short answer is the two set of functions are essentially incompatible. And I suspect Bill Buchanan's jregex will suffer the same issue. Let me explain.
The regex*() set of functions treat a string as a byte stream and is encoding neutral, i.e., it does not assume any encoding of the string, it deals bytes. On the other hand, ICU based ustrreg*() functions assume the string is UTF-8 encoded. The is due to that ICU engine only works with UTF-16 encoded string, hence a conversion of the original string must be performed before passing it to ICU. As any conversion of string to a particular encoding, you have to assume a source encoding. Since Stata 14 uses UTF-8 encoding, the UTF-8 encoding is assumed. The side effects is that if teh assumption is wrong, for example, your source string is encoded in Latin-1, then the new ustrreg*() function will not work the conversion will lose information of the original string. Another situation is that your original "string" is really a byte stream and has no text meaning at all, for example, a byte stream of an image file, new ustrreg*() function will not work either since the conversion will almost surely destroy the string.
Hence the two set of functions basically deal with different cases, regex*() are for byte streams, ustrregex*() are for texts. In the case if your data is all ASCII (English a-z, A-Z, 0-9, and punctuation), both will work. But regex*() are faster, ustrreg*() supports standard regular expression syntax better.
Since Java uses UTF-16 as its internal String encoding as well, -jregex- probably will suffer the same issue if it uses Java String class (this is purely guess as I have not had time to play with jregex).
Comment