Strange behavior by -cond()- and -ustrregexs()-

Ali Atia

Join Date: May 2020

Posts: 737
#1

Strange behavior by -cond()- and -ustrregexs()-

11 Apr 2022, 14:54

I have encountered a strange glitch(?) when combining cond() and ustrregexs(). Using the following code (where the regex operator \d matches a single numeric digit):

Code:

clear input str3 var1 "1a" "2b" "3c" "abc" end gen var2 = cond(ustrregexm(var1,"\d"),ustrregexs(0),var1)

The output I expect is as follows (for each observation which matches the ustrregexm(), var2 contains the matching digit. Otherwise, var2 contains a copy of var1):

Code:

+-------------+ | var1 var2 | |-------------| 1. | 1a 1 | 2. | 2b 2 | 3. | 3c 3 | 4. | abc abc | +-------------+

The actual output looks different, however:

Code:

+-------------+ | var1 var2 | |-------------| 1. | 1a | 2. | 2b 1 | 3. | 3c 2 | 4. | abc abc | +-------------+

While the output for non-matching observations is as expected, it seems that when the ustrregexm() results in a match, that match is used to evaluate the next observation's ustrregexs().

This is odd because ostensibly the current observation's ustrregexm() must be evaluated before the current observation's ustrregexs() in order to determine whether the condition is true or false, which means that the ustrregexs() should subsequently evaluate in the current observation.

I can't put my finger on why exactly cond() and ustrregexs() behave this way. Any ideas would be appreciated.

Note: I am aware I could use ustrregexra to achieve the same effect, but I am specifically hoping to understand why cond behaves this way.

Last edited by Ali Atia; 11 Apr 2022, 15:14.
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#2

11 Apr 2022, 16:27

First, a solution, by splitting the steps into 2 operations.

Code:

gen var3 = ustrregexs(0) if ustrregexm(var1,"\d") replace var3 = var1 if mi(var3) list

result

Code:

+--------------------+ | var1 var2 var3 | |--------------------| 1. | 1a 1 | 2. | 2b 1 2 | 3. | 3c 2 3 | 4. | abc abc abc | +--------------------+

This is speculation -- not a confirmed answer. The behaviour seems to have to with the Unicode regex functions more than cond(), and this sort of thing does come up from time to time. It seems that in the background, -ustrregexm- sets up a class which handles all the parsing and matching behaviour, and then functions like -ustrregexs()- interface with this object to pull out the requested matching piece, for example. What it seems like -cond()- needs is to evaluate all arguments independently, before resolving the conditional substitution. In this way, for the first observation, -ustrregexs(0)- first evaluates to nothing because no prior regex object has been setup with a match. The first argument does create such an object and match, but its results are not available until the subsequent call to ustregexs(). Then you get a "lagging" result as seen in the example.

On the other hand, when -ustrregexm()- is called when creating -var3-, the if clause is resolved first, setting up the regex object, and allowing those results to be accessed in the assignment.
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10058
#3

11 Apr 2022, 16:59

Originally posted by Ali Atia View Post

I have encountered a strange glitch(?) when combining cond() and ustrregexs(). Using the following code (where the regex operator \d matches a single numeric digit):

Code:

clear input str3 var1 "1a" "2b" "3c" "abc" end gen var2 = cond(ustrregexm(var1,"\d"),ustrregexs(0),var1)

It appears that once the evaluation is done, i.e., "ustrregexm(var1,"\d")", the subexpression extracted is from the previous evaluation, as illustrated by your results. Instead, you want to define a separate condition after the evaluation, e.g.,

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str3 var1 "1a" "2b" "3c" "abc" end gen var2 = cond(ustrregexm(var1,"\d"),ustrregexra(var1,"[^\d]", "") ,var1)

Res.:

Code:

. l +-------------+ | var1 var2 | |-------------| 1. | 1a 1 | 2. | 2b 2 | 3. | 3c 3 | 4. | abc abc | +-------------+

Last edited by Andrew Musau; 11 Apr 2022, 17:27.
Comment
Ali Atia

Join Date: May 2020

Posts: 737
#4

11 Apr 2022, 17:14

Originally posted by Leonardo Guizzetti View Post

This is speculation -- not a confirmed answer. The behaviour seems to have to with the Unicode regex functions more than cond(), and this sort of thing does come up from time to time. It seems that in the background, -ustrregexm- sets up a class which handles all the parsing and matching behaviour, and then functions like -ustrregexs()- interface with this object to pull out the requested matching piece, for example. What it seems like -cond()- needs is to evaluate all arguments independently, before resolving the conditional substitution. In this way, for the first observation, -ustrregexs(0)- first evaluates to nothing because no prior regex object has been setup with a match. The first argument does create such an object and match, but its results are not available until the subsequent call to ustregexs(). Then you get a "lagging" result as seen in the example.

On the other hand, when -ustrregexm()- is called when creating -var3-, the if clause is resolved first, setting up the regex object, and allowing those results to be accessed in the assignment.

This seems like the most plausible explanation, though a little unintuitive. If it's correct, it would be useful to have it clearly stated in the documentation for cond() (i.e., both expressions are evaluated before the condition, and then the condition is evaluated to selected either one).
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10058

11 Apr 2022, 17:23

A counter-intuitive way to get around this is to repeat the condition so that the previous evaluation is the desired one.

Code:

clear
input str3 var1
"1a"
"2b"
"3c"
"abc"
end

gen var2= cond(ustrregexm(var1,"\d"),ustrregexs(0), cond(ustrregexm(var1,"\d"),ustrregexs(0),var1))

Res.:

Code:

. l

     +-------------+
     | var1   var2 |
     |-------------|
  1. |   1a      1 |
  2. |   2b      2 |
  3. |   3c      3 |
  4. |  abc    abc |
     +-------------+

Comment

Ali Atia

Join Date: May 2020

Posts: 737
#6

11 Apr 2022, 17:26

Great workaround.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#7

11 Apr 2022, 17:34

#5 works, and I admire the creativity, but it does make it harder to read and understand by comparison to the other workarounds offered.
Comment
Ali Atia

Join Date: May 2020

Posts: 737
#8

11 Apr 2022, 17:35

I am familiar with the other workarounds, but my aim with using cond was to get it in one line without using ustrregexra, which #5 does

Last edited by Ali Atia; 11 Apr 2022, 18:07.
Comment

Announcement

Strange behavior by -cond()- and -ustrregexs()-

Comment

Comment

Comment

Comment

Comment

Comment

Comment