token function is strange

Niels Henrik Bruun

Join Date: Aug 2014
Posts: 551

token function is strange

04 Aug 2015, 03:35

Hi
I was doing some coding when I found the following Mata curiosity:

Code:

: t1 = "This is a test"
: t2 = subinstr(t1, " ", char(9), .)
: t2
  This09is09a09test
: tokens(t1)    // Here space isn't in the output vector which is what I want
          1      2      3      4
    +-----------------------------+
  1 |  This     is      a   test  |
    +-----------------------------+
: tokens(t2, char(9))    // Here for some reason char(9)/tab is in the output vector
          1      2      3      4      5      6      7
    +--------------------------------------------------+
  1 |  This     09     is     09      a     09   test  |
    +--------------------------------------------------+
: select(dummy=tokens(t2, char(9)), dummy :!= char(9))    // So I have to do something like this to get the proper output
          1      2      3      4
    +-----------------------------+
  1 |  This     is      a   test  |
    +-----------------------------+

Is this a bug?
Or is there some good reason for this?
And is there a smarter way of getting the "proper" output than my select?

Kind regards

nhb

Tags: None

Christophe Kolodziejczyk

Join Date: Mar 2014

Posts: 377
#2

04 Aug 2015, 05:24

I think I have the solution (may be you found too in the meantime)

Code:

t = tokeninit(char(9)) tokenset(t,t2) tokengetall(t)
Comment
Niels Henrik Bruun

Join Date: Aug 2014

Posts: 551
#3

04 Aug 2015, 05:42

Hi Christophe
Thank you very much.
I'll try that in my code.

But I still think there is something wrong with the tokens function!

Kind regards

nhb
Comment
Christophe Kolodziejczyk

Join Date: Mar 2014

Posts: 377
#4

04 Aug 2015, 06:12

I don't think there is a bug in tokens(). When you use tokeninit() you have to provide the wchar, i.e. the delimiter which identifies white spaces and then the pchar, i.e. the parsing characters . In tokens you have to provide only the pchar (or parsechar).
Comment
Niels Henrik Bruun

Join Date: Aug 2014

Posts: 551
#5

04 Aug 2015, 06:25

Hi Christophe
You're right as so far of tokeninit etc.
I still think that the -tokens- function being simple should keep pchar and wchar the same as a simple split function.
But maybe it is just me thinking so

Kind regards

nhb
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

04 Aug 2015, 09:40

I have to agree with Niels that tokens() functions in a counter-intuitive manner, especially if you have prior experience with split() or with import delimited or with parsing routines in other programs (e.g. Perl, Excel). In all those cases, the character specified as the delimiter is discarded from the parsed results.

This behavior isn't described in the M[5] documentation for tokens(). If it were intuitive, it wouldn't need documentation; if it's uncommon it should be documented.
Comment

Niels Henrik Bruun

Join Date: Aug 2014
Posts: 551

28 Oct 2015, 07:11

Hi
I can't even get the solution from Christophe to work properly, when eg parsing csv files (I know there are Stata functions for that

)
So in the end I had to build my own.
For other lost souls I leave my solution here:

Code:

    function tokensplit(string scalar txt, delimiter)
    {
        string vector  row
        string scalar filter
        row = J(1,0,"")
        filter = sprintf("(.*)%s(.*)", delimiter)
        while (regexm(txt, filter)) {
            txt = regexs(1)
            row = regexs(2), row
        }
        row = txt, row
        return(row)
    }

Have fun

Kind regards

nhb

Comment

daniel klein

Join Date: Mar 2014

Posts: 3771
#8

28 Oct 2015, 12:41

Christophe's solution should read

Code:

t = tokeninit((" "+char(9)))

treating /tab and (white) space as white-space characters.

Terminology is consistent within Mata, i.e., between tokens() and tokenget(), but indeed differs between Mata and Stata functions. In Mata wchars (white-space characters) are different from pchars (parsing characters), in the way documented in tokenget(). In Stata, what is called pchars, e.g. in gettoken(), corresponds to Mata's concept of wchars. This is indeed unfortunate.

If Niels is interested in more complex parsing, then reading thru tokenget() will be worth the effort. Rewriting his function, could read something like

Code:

string rowvector mytokensplit(string scalar txt, string scalar delimiter) { transmorphic scalar t t = tokeninit((" "+delimiter)) tokenset(t, txt) return(tokengetall(t)) }

Best
Daniel

Last edited by daniel klein; 28 Oct 2015, 12:46.
Comment
Niels Henrik Bruun

Join Date: Aug 2014

Posts: 551
#9

28 Oct 2015, 15:41

Hi Daniel
Thank you very much for your answer.

I'm sorry for not specifying the problem I mean to solve with my function.
The problem has to do with empty fields in a row.

The string variable txt is a typical csv line with every even field as empty/blank.
Comparing yours and mine function gives:

Code:

: txt = "this is;; a test;" : mytokensplit(txt, ";") 1 2 3 4 +-----------------------------+ 1 | this is a test | +-----------------------------+ : tokensplit(txt, ";") 1 2 3 4 +-----------------------------------------+ 1 | this is a test | +-----------------------------------------+

As you can see, it is my function (tokensplit) that does the job.

It might be that the tokeninit/tokenset can handle this, but it is not intuitive how.
So I given up on getting something usefull out of tokeninit/tokenset.
Especially when I can write something better within half an hour.

Kind regards

nhb
Comment

Announcement

token function is strange

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment