Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • regexm as a vector function - regex rocks!!

    Hi
    By accident I found out that regexm is a vector function in both arguments.
    This makes regexm even more powerfull!!
    Take a look:
    Code:
    : txt = "We discuss Stata, statistics, and Stata and statistics. You can browse without registering but you need to register to participate in the discussion or ask a question."
    
    : tok = tokens(txt)    // Text to string vector
    : select(tok, regexm(tok, "s$"))    // Get words ending on "s"
      discuss
    : select(tok, regexm(tok, "s$|s\.$"))    // Get words ending on "s" or "s."
                     1             2
        +-----------------------------+
      1 |      discuss   statistics.  |
        +-----------------------------+
    : select(tok, regexm(tok, "q"))    // Get words containing "q"
      question.
    : select(tok, regexm(tok, "[se]$"))    // Words ending on either "s" or "e"
                     1             2             3             4             5
        +-----------------------------------------------------------------------+
      1 |           We       discuss        browse   participate           the  |
        +-----------------------------------------------------------------------+
    
    // Now the other way around:
    
    : regexm(txt, ("ss", "se", "so"))    // Which of the strings "ss", "se" and "so" appear in txt?
           1   2   3
        +-------------+
      1 |  1   1   0  |
        +-------------+
    : any(regexm(txt, ("ss", "se", "so")))    // Does any of the strings "ss", "se" and "so" appear in txt?
      1
    : all(regexm(txt, ("ss", "se", "so")))    // Do all of them?
      0
    If you (as I do) has to handle string data from time to time then you will probably agree that this is quite nice
    Last edited by Niels Henrik Bruun; 04 Aug 2015, 06:00.
    Kind regards

    nhb

  • #2
    I agree

    Comment


    • #3
      Niels --

      Good observation! One thing that is odd about that, however, is that I couldn't figure out how to get this to work with regexs(). For example, if I use your text, and do:

      Code:
      regexm(txt,("ss","se","so"))
      regexs()
      I get an error, however, if I do
      Code:
      regexm(txt,"ss")
      regexs()
      I get the desired result. Do you think this implies that regexm() can only work with scalars?

      Best,

      Matt



      Comment


      • #4
        Hi Matt
        So far as I can see -regexs- behaves rather poorly.
        When I use -regexs- it is to find and separate eg dates and times in text blocks.
        Let's say that I'm looking for strings of numbers with a dash somewhere inbetween and I want to split this string into 2 number strings:
        Code:
        : x = "123-5345"
        : regexm(x, "^([0-9]+)\-([0-9]+)$")
          1
        // regex without arguments gives the input as well as the 2 number blocks
        : regexs()
                      1          2          3
            +----------------------------------+
          1 |  123-5345        123       5345  |
            +----------------------------------+
        // And they can be retrieved with arguments (numbers 0 to 9)
        : for (r=0;r<4;r++) regexs(r)
          123-5345
          123
          5345
        invalid number, outside of allowed range  
        // Now I insists that first number string and I get nothing but errors
        : regexm(x, "^(2[0-9]+)\-([0-9]+)$")
          0
        : regexs()
        invalid number, outside of allowed range
        So what if I want to use 2 filters at once:
        Code:
        : regexm(x, ("^([0-9]+)\-([0-9]+)$", "^(2[0-9]+)\-([0-9]+)$"))
               1   2
            +---------+
          1 |  1   0  |
            +---------+
        : regexs()
        invalid number, outside of allowed range
        : regexm(x, ("^(2[0-9]+)\-([0-9]+)$", "^([0-9]+)\-([0-9]+)$"))
               1   2
            +---------+
          1 |  0   1  |
            +---------+
        : regexs()
                      1          2          3
            +----------------------------------+
          1 |  123-5345        123       5345  |
            +----------------------------------+
        Then it appears that only the results from the last filter is kept.
        This I think can in part be handled by writing some more advanced regular expressions.

        I find it more problematic the other way around:
        Code:
        : regexm(x, "^([0-9]+)\-([0-9]+)$")
               1   2
            +---------+
          1 |  1   1  |
            +---------+
        : regexs()
                      1          2          3
            +----------------------------------+
          1 |  432-5343        432       5343  |
            +----------------------------------+
        Here also -regexs- only works on the last row.
        And here I think it would have been very nice if you could call -regexs- directly with 3 arguments: the string to filter, the filter string, and a number for what bracket I want to see.
        Better still would be to get a 9 or 10 column matrix with the combined output since we are only allowed to retrieve a maximum number of values of 9.

        But -regexs- is better than nothing and a lot can be helped by coding
        Kind regards

        nhb

        Comment

        Working...
        X