regexm as a vector function - regex rocks!!

Niels Henrik Bruun

Join Date: Aug 2014
Posts: 552

regexm as a vector function - regex rocks!!

04 Aug 2015, 04:56

Hi
By accident I found out that regexm is a vector function in both arguments.
This makes regexm even more powerfull!!
Take a look:

Code:

: txt = "We discuss Stata, statistics, and Stata and statistics. You can browse without registering but you need to register to participate in the discussion or ask a question."

: tok = tokens(txt)    // Text to string vector
: select(tok, regexm(tok, "s$"))    // Get words ending on "s"
  discuss
: select(tok, regexm(tok, "s$|s\.$"))    // Get words ending on "s" or "s."
                 1             2
    +-----------------------------+
  1 |      discuss   statistics.  |
    +-----------------------------+
: select(tok, regexm(tok, "q"))    // Get words containing "q"
  question.
: select(tok, regexm(tok, "[se]$"))    // Words ending on either "s" or "e"
                 1             2             3             4             5
    +-----------------------------------------------------------------------+
  1 |           We       discuss        browse   participate           the  |
    +-----------------------------------------------------------------------+

// Now the other way around:

: regexm(txt, ("ss", "se", "so"))    // Which of the strings "ss", "se" and "so" appear in txt?
       1   2   3
    +-------------+
  1 |  1   1   0  |
    +-------------+
: any(regexm(txt, ("ss", "se", "so")))    // Does any of the strings "ss", "se" and "so" appear in txt?
  1
: all(regexm(txt, ("ss", "se", "so")))    // Do all of them?
  0

If you (as I do) has to handle string data from time to time then you will probably agree that this is quite nice

Last edited by Niels Henrik Bruun; 04 Aug 2015, 05:00.

Kind regards

nhb

Tags: None

Christophe Kolodziejczyk

Join Date: Mar 2014

Posts: 377
#2

04 Aug 2015, 05:04

I agree
Comment
Matthew J. Baker

Join Date: Mar 2014

Posts: 126
#3

04 Aug 2015, 06:32

Niels --

Good observation! One thing that is odd about that, however, is that I couldn't figure out how to get this to work with regexs(). For example, if I use your text, and do:

Code:

regexm(txt,("ss","se","so")) regexs()

I get an error, however, if I do

Code:

regexm(txt,"ss") regexs()

I get the desired result. Do you think this implies that regexm() can only work with scalars?

Best,

Matt
Comment

Niels Henrik Bruun

Join Date: Aug 2014
Posts: 552

05 Aug 2015, 07:50

Hi Matt
So far as I can see -regexs- behaves rather poorly.
When I use -regexs- it is to find and separate eg dates and times in text blocks.
Let's say that I'm looking for strings of numbers with a dash somewhere inbetween and I want to split this string into 2 number strings:

Code:

: x = "123-5345"
: regexm(x, "^([0-9]+)\-([0-9]+)$")
  1
// regex without arguments gives the input as well as the 2 number blocks
: regexs()
              1          2          3
    +----------------------------------+
  1 |  123-5345        123       5345  |
    +----------------------------------+
// And they can be retrieved with arguments (numbers 0 to 9)
: for (r=0;r<4;r++) regexs(r)
  123-5345
  123
  5345
invalid number, outside of allowed range  
// Now I insists that first number string and I get nothing but errors
: regexm(x, "^(2[0-9]+)\-([0-9]+)$")
  0
: regexs()
invalid number, outside of allowed range

So what if I want to use 2 filters at once:

Code:

: regexm(x, ("^([0-9]+)\-([0-9]+)$", "^(2[0-9]+)\-([0-9]+)$"))
       1   2
    +---------+
  1 |  1   0  |
    +---------+
: regexs()
invalid number, outside of allowed range
: regexm(x, ("^(2[0-9]+)\-([0-9]+)$", "^([0-9]+)\-([0-9]+)$"))
       1   2
    +---------+
  1 |  0   1  |
    +---------+
: regexs()
              1          2          3
    +----------------------------------+
  1 |  123-5345        123       5345  |
    +----------------------------------+

Then it appears that only the results from the last filter is kept.
This I think can in part be handled by writing some more advanced regular expressions.

I find it more problematic the other way around:

Code:

: regexm(x, "^([0-9]+)\-([0-9]+)$")
       1   2
    +---------+
  1 |  1   1  |
    +---------+
: regexs()
              1          2          3
    +----------------------------------+
  1 |  432-5343        432       5343  |
    +----------------------------------+

Here also -regexs- only works on the last row.
And here I think it would have been very nice if you could call -regexs- directly with 3 arguments: the string to filter, the filter string, and a number for what bracket I want to see.
Better still would be to get a 9 or 10 column matrix with the combined output since we are only allowed to retrieve a maximum number of values of 9.

But -regexs- is better than nothing and a lot can be helped by coding

Kind regards

nhb

Announcement

regexm as a vector function - regex rocks!!

Comment

Comment

Comment