Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strange behavior by -cond()- and -ustrregexs()-

    I have encountered a strange glitch(?) when combining cond() and ustrregexs(). Using the following code (where the regex operator \d matches a single numeric digit):

    Code:
    clear
    input str3 var1
    "1a"
    "2b"
    "3c"
    "abc"
    end
    gen var2 = cond(ustrregexm(var1,"\d"),ustrregexs(0),var1)
    The output I expect is as follows (for each observation which matches the ustrregexm(), var2 contains the matching digit. Otherwise, var2 contains a copy of var1):

    Code:
         +-------------+
         | var1   var2 |
         |-------------|
      1. |   1a      1 |
      2. |   2b      2 |
      3. |   3c      3 |
      4. |  abc    abc |
         +-------------+
    The actual output looks different, however:

    Code:
         +-------------+
         | var1   var2 |
         |-------------|
      1. |   1a        |
      2. |   2b      1 |
      3. |   3c      2 |
      4. |  abc    abc |
         +-------------+
    While the output for non-matching observations is as expected, it seems that when the ustrregexm() results in a match, that match is used to evaluate the next observation's ustrregexs().

    This is odd because ostensibly the current observation's ustrregexm() must be evaluated before the current observation's ustrregexs() in order to determine whether the condition is true or false, which means that the ustrregexs() should subsequently evaluate in the current observation.

    I can't put my finger on why exactly cond() and ustrregexs() behave this way. Any ideas would be appreciated.

    Note: I am aware I could use ustrregexra to achieve the same effect, but I am specifically hoping to understand why cond behaves this way.
    Last edited by Ali Atia; 11 Apr 2022, 16:14.

  • #2
    First, a solution, by splitting the steps into 2 operations.

    Code:
    gen var3 = ustrregexs(0) if ustrregexm(var1,"\d")
    replace var3 = var1 if mi(var3)
    list
    result

    Code:
         +--------------------+
         | var1   var2   var3 |
         |--------------------|
      1. |   1a             1 |
      2. |   2b      1      2 |
      3. |   3c      2      3 |
      4. |  abc    abc    abc |
         +--------------------+
    This is speculation -- not a confirmed answer. The behaviour seems to have to with the Unicode regex functions more than cond(), and this sort of thing does come up from time to time. It seems that in the background, -ustrregexm- sets up a class which handles all the parsing and matching behaviour, and then functions like -ustrregexs()- interface with this object to pull out the requested matching piece, for example. What it seems like -cond()- needs is to evaluate all arguments independently, before resolving the conditional substitution. In this way, for the first observation, -ustrregexs(0)- first evaluates to nothing because no prior regex object has been setup with a match. The first argument does create such an object and match, but its results are not available until the subsequent call to ustregexs(). Then you get a "lagging" result as seen in the example.

    On the other hand, when -ustrregexm()- is called when creating -var3-, the if clause is resolved first, setting up the regex object, and allowing those results to be accessed in the assignment.

    Comment


    • #3
      Originally posted by Ali Atia View Post
      I have encountered a strange glitch(?) when combining cond() and ustrregexs(). Using the following code (where the regex operator \d matches a single numeric digit):

      Code:
      clear
      input str3 var1
      "1a"
      "2b"
      "3c"
      "abc"
      end
      gen var2 = cond(ustrregexm(var1,"\d"),ustrregexs(0),var1)
      It appears that once the evaluation is done, i.e., "ustrregexm(var1,"\d")", the subexpression extracted is from the previous evaluation, as illustrated by your results. Instead, you want to define a separate condition after the evaluation, e.g.,

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str3 var1
      "1a"
      "2b"
      "3c"
      "abc"
      end
      
      gen var2 = cond(ustrregexm(var1,"\d"),ustrregexra(var1,"[^\d]", "") ,var1)
      Res.:

      Code:
      . l
      
           +-------------+
           | var1   var2 |
           |-------------|
        1. |   1a      1 |
        2. |   2b      2 |
        3. |   3c      3 |
        4. |  abc    abc |
           +-------------+
      Last edited by Andrew Musau; 11 Apr 2022, 18:27.

      Comment


      • #4
        Originally posted by Leonardo Guizzetti View Post
        This is speculation -- not a confirmed answer. The behaviour seems to have to with the Unicode regex functions more than cond(), and this sort of thing does come up from time to time. It seems that in the background, -ustrregexm- sets up a class which handles all the parsing and matching behaviour, and then functions like -ustrregexs()- interface with this object to pull out the requested matching piece, for example. What it seems like -cond()- needs is to evaluate all arguments independently, before resolving the conditional substitution. In this way, for the first observation, -ustrregexs(0)- first evaluates to nothing because no prior regex object has been setup with a match. The first argument does create such an object and match, but its results are not available until the subsequent call to ustregexs(). Then you get a "lagging" result as seen in the example.

        On the other hand, when -ustrregexm()- is called when creating -var3-, the if clause is resolved first, setting up the regex object, and allowing those results to be accessed in the assignment.
        This seems like the most plausible explanation, though a little unintuitive. If it's correct, it would be useful to have it clearly stated in the documentation for cond() (i.e., both expressions are evaluated before the condition, and then the condition is evaluated to selected either one).

        Comment


        • #5
          A counter-intuitive way to get around this is to repeat the condition so that the previous evaluation is the desired one.

          Code:
          clear
          input str3 var1
          "1a"
          "2b"
          "3c"
          "abc"
          end
          
          gen var2= cond(ustrregexm(var1,"\d"),ustrregexs(0), cond(ustrregexm(var1,"\d"),ustrregexs(0),var1))
          Res.:

          Code:
          . l
          
               +-------------+
               | var1   var2 |
               |-------------|
            1. |   1a      1 |
            2. |   2b      2 |
            3. |   3c      3 |
            4. |  abc    abc |
               +-------------+

          Comment


          • #6
            Great workaround.

            Comment


            • #7
              #5 works, and I admire the creativity, but it does make it harder to read and understand by comparison to the other workarounds offered.

              Comment


              • #8
                I am familiar with the other workarounds, but my aim with using cond was to get it in one line without using ustrregexra, which #5 does
                Last edited by Ali Atia; 11 Apr 2022, 19:07.

                Comment

                Working...
                X