Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • token function is strange

    Hi
    I was doing some coding when I found the following Mata curiosity:
    Code:
    : t1 = "This is a test"
    : t2 = subinstr(t1, " ", char(9), .)
    : t2
      This09is09a09test
    : tokens(t1)    // Here space isn't in the output vector which is what I want
              1      2      3      4
        +-----------------------------+
      1 |  This     is      a   test  |
        +-----------------------------+
    : tokens(t2, char(9))    // Here for some reason char(9)/tab is in the output vector
              1      2      3      4      5      6      7
        +--------------------------------------------------+
      1 |  This     09     is     09      a     09   test  |
        +--------------------------------------------------+
    : select(dummy=tokens(t2, char(9)), dummy :!= char(9))    // So I have to do something like this to get the proper output
              1      2      3      4
        +-----------------------------+
      1 |  This     is      a   test  |
        +-----------------------------+
    Is this a bug?
    Or is there some good reason for this?
    And is there a smarter way of getting the "proper" output than my select?
    Kind regards

    nhb

  • #2
    I think I have the solution (may be you found too in the meantime)

    Code:
    t = tokeninit(char(9))
    tokenset(t,t2)
    tokengetall(t)

    Comment


    • #3
      Hi Christophe
      Thank you very much.
      I'll try that in my code.

      But I still think there is something wrong with the tokens function!
      Kind regards

      nhb

      Comment


      • #4
        I don't think there is a bug in tokens(). When you use tokeninit() you have to provide the wchar, i.e. the delimiter which identifies white spaces and then the pchar, i.e. the parsing characters . In tokens you have to provide only the pchar (or parsechar).

        Comment


        • #5
          Hi Christophe
          You're right as so far of tokeninit etc.
          I still think that the -tokens- function being simple should keep pchar and wchar the same as a simple split function.
          But maybe it is just me thinking so
          Kind regards

          nhb

          Comment


          • #6
            I have to agree with Niels that tokens() functions in a counter-intuitive manner, especially if you have prior experience with split() or with import delimited or with parsing routines in other programs (e.g. Perl, Excel). In all those cases, the character specified as the delimiter is discarded from the parsed results.

            This behavior isn't described in the M[5] documentation for tokens(). If it were intuitive, it wouldn't need documentation; if it's uncommon it should be documented.

            Comment


            • #7
              Hi
              I can't even get the solution from Christophe to work properly, when eg parsing csv files (I know there are Stata functions for that)
              So in the end I had to build my own.
              For other lost souls I leave my solution here:
              Code:
                  function tokensplit(string scalar txt, delimiter)
                  {
                      string vector  row
                      string scalar filter
                      row = J(1,0,"")
                      filter = sprintf("(.*)%s(.*)", delimiter)
                      while (regexm(txt, filter)) {
                          txt = regexs(1)
                          row = regexs(2), row
                      }
                      row = txt, row
                      return(row)
                  }
              Have fun
              Kind regards

              nhb

              Comment


              • #8
                Christophe's solution should read

                Code:
                t = tokeninit((" "+char(9)))
                treating /tab and (white) space as white-space characters.

                Terminology is consistent within Mata, i.e., between tokens() and tokenget(), but indeed differs between Mata and Stata functions. In Mata wchars (white-space characters) are different from pchars (parsing characters), in the way documented in tokenget(). In Stata, what is called pchars, e.g. in gettoken(), corresponds to Mata's concept of wchars. This is indeed unfortunate.

                If Niels is interested in more complex parsing, then reading thru tokenget() will be worth the effort. Rewriting his function, could read something like

                Code:
                string rowvector mytokensplit(string scalar txt, string scalar delimiter)
                {
                    transmorphic scalar t
                    
                    t = tokeninit((" "+delimiter))
                    tokenset(t, txt)
                    return(tokengetall(t))
                }
                Best
                Daniel
                Last edited by daniel klein; 28 Oct 2015, 12:46.

                Comment


                • #9
                  Hi Daniel
                  Thank you very much for your answer.

                  I'm sorry for not specifying the problem I mean to solve with my function.
                  The problem has to do with empty fields in a row.

                  The string variable txt is a typical csv line with every even field as empty/blank.
                  Comparing yours and mine function gives:
                  Code:
                  :         txt = "this is;; a test;"
                  
                  :         mytokensplit(txt, ";")
                            1      2      3      4
                      +-----------------------------+
                    1 |  this     is      a   test  |
                      +-----------------------------+
                  
                  :         tokensplit(txt, ";")
                               1         2         3         4
                      +-----------------------------------------+
                    1 |  this is              a test            |
                      +-----------------------------------------+
                  As you can see, it is my function (tokensplit) that does the job.

                  It might be that the tokeninit/tokenset can handle this, but it is not intuitive how.
                  So I given up on getting something usefull out of tokeninit/tokenset.
                  Especially when I can write something better within half an hour.
                  Kind regards

                  nhb

                  Comment

                  Working...
                  X