Behavior of -tokens()-

Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#1

Behavior of -tokens()-

14 Apr 2021, 11:12

Dear All,

I am puzzled by the following behavior of the mata function tokens():

Specifically:
why is the opening bracket "(" kept together with the first token, but the closing bracket ")" has formed its own token?

why is "aaa" kept in quotes, but bbb left without quotes?

Code:

mata tokens(`"("aaa" "bbb" "cc dd ee ff")"')

I would expect a symmetry here between the first and the last tokens (regardless of what definition is behind the parsing algorithm).

Thank you, Sergiy Radyakin

PS: Stata 16.1 on Windows.
Tags: parsing, tokens
daniel klein

Join Date: Mar 2014

Posts: 3811
#2

14 Apr 2021, 11:59

We can simplify to

Code:

. mata tokens(`"a"b" "c" "d"e"') 1 2 3 4 +-----------------------------+ 1 | a"b" c d e | +-----------------------------+

What happens, I think, is this: First, Mata [or Stata?] strips the outside (compound) quotes, so we are left with

Code:

a"b" "c" "d"e

to parse.

Edit/Update/Correction:

We are now parsing from left to right. We are parsing on spaces, except we encounter an opening quote, in which case we look for the respective closing quote and return the token with both outer quotes removed. This, then results in:

Code:

a"b" c d e

Last edited by daniel klein; 14 Apr 2021, 12:08.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#3

14 Apr 2021, 16:02

Dear Daniel,

hank you for your explanation. I feel this is a bit far-fetched, though corresponds to the observed behavior. I'd expect the algorithm to ignore the delimiters if they occur within a quoted content, but not to produce a new token as soon as the quoted content ends. So that for {content}
{abc"xyz xyz"abc} is 1 token {abc"xyz xyz"abc}
{abc xyz xyz abc} is 4 tokens {abc} {xyz} {xyz} {abc}
{abc"xyz" "xyz"abc} is 2 tokens {abc"xyz"} {"xyz"abc}
{"abc" "xyz" "xyz""abc"} is 3 tokens {abc} {xyz} {"xyz""abc"}
{"abc xyz" "xyz abc"} is 2 tokens {abc xyz} {xyz abc}
etc.

This should produce tokens invariantly, whether we parse from left to right or from right to left (I expect), while current behavior is not symmetric:
{A"B"} is one token, but {"A"B} is two tokens.

PS: and I love symmetry, admittedly

Best, Sergiy Radyakin
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2384
#4

14 Apr 2021, 16:42

I think that if you the quotes are important to keep as part of the tokens (or turned around, don't want the syntax parser to look at quotes as special characters), then each token with them will need to be decorated with compound quotes.

E.g.,

Code:

mata tokens(`" `"abc"xyz""' `""xyz"abc"' "')
1 like
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2384

14 Apr 2021, 16:59

I found a solution that might be to your liking. I did not realize till going back to the help for -tokens()- that Mata provides an interface to manipulate the parser's behaviour. We can simply tell it not to look for and parse on quoting characters using a null string vector.

Code:

clear *
cls

mata:
  inputs = (`"a"b" "c" "d"e"' \ ///
            `"abc"xyz" "xyz"abc"' \ ///
            `"abc"xyz xyz"abc"' \ ///
            `"abc xyz xyz abc"' \ ///
            `"abc"xyz" "xyz"abc"' \ ///
            `""abc" "xyz" "xyz""abc""' \ ///
            `""abc xyz" "xyz abc""'  ///
            )
  inputs

  t = tokeninit(" ", "", J(1,0,"")) // override quote characters to disallow them.
 
  for(i = 1; i <= length(inputs); i++) {
    tokenset(t, inputs[i])
    res = tokengetall(t)
    "Input {" + inputs[i] + "} yields tokens"
    res'
  }
 
end

Comment

Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#6

14 Apr 2021, 17:30

Dear Leonardo, thank you very much for pointing at the option, but I don't want to disregard the quotes completely, as I need the textual labels to be treated as a single token.

Your code above produces (fragment):

Code:

Input {abc"xyz xyz"abc} yields tokens 1 +-----------+ 1 | abc"xyz | 2 | xyz"abc | +-----------+

while I expect a single token in this case, as outlined above.
The parser with the logic that is understandable to me should:
start the token a+b+c

start ignoring spaces +"+x+y+z+space+x+y+z+"

start reacting to spaces +a+b+c

reach end of string

yield a single token.

This is not a big deal. I believe that if the user writes the command with a correct syntax (this is related to this post) then it shouldn't matter much. And I am still hopeful that I will be able to avoid manual parsing the options with the help of the users of this forum.

Sincerely, Sergiy
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2384
#7

14 Apr 2021, 18:05

Ah I got close then. I think this requires some kind of hierarchical approach to do precisely what you want then, but if it matters a lot, I think this provides some inspiration to do so.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3811
#8

15 Apr 2021, 00:51

Having glanced at your new post, I believe we need to clarify. Please show the exact syntax that would produce

{abc"xyz xyz"abc} is 1 token {abc"xyz xyz"abc}
{abc"xyz" "xyz"abc} is 2 tokens {abc"xyz"} {"xyz"abc}
{"abc" "xyz" "xyz""abc"} is 3 tokens {abc} {xyz} {"xyz""abc"}

I get

Code:

. mata tokens(`"abc"xyz xyz"abc"') 1 2 +---------------------+ 1 | abc"xyz xyz"abc | +---------------------+ . mata tokens(`"abc"xyz" "xyz"abc"') 1 2 3 +----------------------------------+ 1 | abc"xyz" xyz abc | +----------------------------------+ . mata tokens(`""abc" "xyz" "xyz""abc""') 1 2 3 4 +-------------------------+ 1 | abc xyz xyz abc | +-------------------------+

which is what I would expect. In all examples above, the outer compound quotes delemit the one argument that is to be passed to tokens(); these quotes are not part of the argument and get stripped before tokens() even starts parsing. My guess is that you are really fetching the strings directly from Stata locals, as in

Code:

tokens(st_local("lmacname"))

in which case the outer (compound) quotes inside lmacname, if there are any, are passed through to tokens().

Originally posted by Sergiy Radyakin View Post

I'd expect the algorithm to ignore the delimiters if they occur within a quoted content,

I think this is exactly what happens. The delimiters here are spaces, and tokens() does ignore them if they appear inside quoted content. More precisely, tokens() treats them like any other character if they appear inside quoted content. This is evident in

{"abc xyz" "xyz abc"} is 2 tokens {abc xyz} {xyz abc}

Originally posted by Sergiy Radyakin View Post

{A"B"} is one token, but {"A"B} is two tokens.

I like symmetry, too, but this is about parsing characters not symmetry. As I understand the way that tokens() work, it does matter whether we parse from left to right or from right to left. There is a hint at this in the help (emphasis mine)

If s contains quoted material and the quotes do not match, results are as if the appropriate number of close quotes were added to the end of s.

I agree that this behavior is arbitrary but that is the way StataCorp chose to implement and document tokens().

Last edited by daniel klein; 15 Apr 2021, 00:53.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#9

15 Apr 2021, 05:49

Please show the exact syntax that would produce

{abc"xyz xyz"abc} is 1 token {abc"xyz xyz"abc}
{abc"xyz" "xyz"abc} is 2 tokens {abc"xyz"} {"xyz"abc}
{"abc" "xyz" "xyz""abc"} is 3 tokens {abc} {xyz} {"xyz""abc"}

Dear Daniel, that was description of desired behavior by various cases/example inputs. So, I don't have a syntax that produces this, as Stata's tokens() produces different results. With the help of the trick that Nick has shown in the other thread mentioned above I can now avoid the need to manually parse the collection of parameters, something that I wanted to avoid precisely because I didn't want to deal with these peculiarities.

Thank you, and best regards, Sergiy
Comment

Announcement

Behavior of -tokens()-

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment