Tokens: different behaviour depending on the parsechar specified: bug? feature?

Salah Mahmud

Join Date: Jun 2016

Posts: 30
#1

Tokens: different behaviour depending on the parsechar specified: bug? feature?

30 Apr 2017, 21:06

Hello,
Does anyone knows why tokens("0.1.0", ".") or tokens("0.1.0", char(46)) produces 5 cells rowvector whereas tokens("0 1 0", " ") produces only 3 cells?

This is what I get in Stata 14.2:

. mata: tokens("0.1.0", ".")
1 2 3 4 5
+---------------------+
1 | 0 . 1 . 0 |
+---------------------+

. mata: tokens("0 1 0", " ")
1 2 3
+-------------+
1 | 0 1 0 |
+-------------+
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#2

30 Apr 2017, 22:23

It looks like unexpected behavior to me, i.e., a bug.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

01 May 2017, 08:31

I agree that this behavior of tokens() is unexpected, and help mf_tokens does not suggest this is intended. However, I'm guessing that this is built on top of tokenget() and help mf_tokenget has much to say. tokenget() distinguishes between "parsing characters" and "white-space characters", and the behavior we anticipated from tokens("0.1.0", ".") would appear to require treating the second argument as the "white space character" rather than a "parsing character". It should in theory be possible to use tokeninit(), tokenset(), and tokenget() to obtain the desired resiults, but that's well beyond my competence to implement.

Consider also that the behavior is consistent with the tokenize command.

Code:

. tokenize "0.1.0", parse(".") . di "1=|`1'|, 2=|`2'|, 3=|`3'|, 4=|`4'|, 5=|`5'|, 6=|`6'|" 1=|0|, 2=|.|, 3=|1|, 4=|.|, 5=|0|, 6=||
1 like
Comment
Belinda Foster

Join Date: Jul 2016

Posts: 132
#4

01 May 2017, 13:30

Behavior that is not documented and thus unexpected definitely qualifies as a bug. StataCorp should take note!
Comment

Announcement

Tokens: different behaviour depending on the parsechar specified: bug? feature?

Comment

Comment

Comment