Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Implementing a nonparametric entropy estimator for symbolic sequences

    Dear Statalisters,
    I wish to implement the nonparametric entropy estimator for symbolic sequences (in my case: textual data) as suggested by Kontoyiannis et al. (1998: "Nonparametric entropy estimation for stationary processes and random fields, with applications to English text"; IEEE Transactions Volume 44, Issue 3 ).

    Roughly speaking, for each position i of a given string S, one can measure the length li of the longest string starting at i that reappears somewhere else in between position 1 and i minus 1 of S. The idea of the estimator is that longer match-lengths signal more redundancy in the data (and therefore: a smaller entropy rate).
    More precisely, for every position i, li corresponds to the length of the longest subsequence that starts at i and does not appear anywhere else between 1 and (i - 1) of S.

    Here is an example for the short string "Mata and Stata":

    position character match_length
    1 M 1
    2 a 1
    3 t 1
    4 a 2
    5 1
    6 a 2
    7 n 1
    8 d 1
    9 2
    10 S 1
    11 t 3
    12 a 3
    13 t 2
    14 a 1

    At position i = 11, the length of the "shortest mismatch" is li = 3, because both "t" and "ta" appear in the subsequent data, but "tat" does not.
    Here is my attempt to implement the estimator: I first read in the data with Stata and export a text file with one character per line:

    Code:
    clear
    set obs 1
    gen str word="Mata and Stata"
    local length= ustrlen(word)
    expand `length'
    gen order_id=_n
    gen char=usubstr(word,order_id,1)
    drop word
    export delimited char using "string.txt", delimiter(tab) novarnames replace

    This is, of course, only an example, in a real case scenario, the strings are much longer (whole books for instance). This is the main problem for me at the moment: while the code below works for short strings (up to 1,000 characters), it gets very slow for longer strings.
    Anyhow, the code above reads in the sequences as one consecutive string and then processes this sequence by placing each character of it in one separate line. The prepared string is then exported. In what follows, I use Mata to calculate the match_lengthes for each character. NB.: this is the first time I work with Mata, just in case you are wondering about anything unusual.

    First of all, I open the string with fopen() twice, one handle for the text that still has to be processed ("lookahead_input") and one handle for the text that has already been processed ("buffer_input").

    Since the buffer is empty for the first character, the match-length li is 1 at postion 1 and the calculation starts at position 2 (fget(lookahead_in) "burns" the first character). I then generate two empty strings, again one for the buffer (at position 2 the buffer consists of the first character of S, while the "lookahead" consists of all remaining characters of the string S.
    After that, I set up a loop that processes each character. I use strmatch(buffer,"*"+lookahead+"*") to determine if there is a match between the buffer and the lookahead, but maybe there are more efficient ways to to this. If no match is found in the buffer, the length of the shortest-mismatch-length li = 1. If, however, a match is found, I first use ftell(lookahead_input) to record the current postion of the lookahead window. After the match-length for the character at position i has been determined, I use fseek(lookahead_input,pos,-1) to return to the initial position to continue the calculation for the next character.

    Here is what the "while" loop does: if a a match is found in the buffer for the first character of the lookahead, li is incremented by one and one additional character is added to the lookahead window. This procedure is executed as long as there is a match between the buffer and the lookahead.
    After the match-lengths are calculated for all characters, both handles are closed and the resulting match-lengthes for each characters are put into Stata with getmata.


    Code:
    mata:
    buffer_input = fopen("string.txt", "r")
    lookahead_input = fopen("string.txt", "r")
             
    fget(lookahead_input)   
    buffer=""
    lookahead=""
    
    stringlength=rows(cat("string.txt")) /* determine the number of characters of the whole string to set up the loop */
    match_length=1 /* since the buffer is empty for the first character, the match_length is 1 at i=1. */
    
    for (i=1; i<=(stringlength-1); i++) {
                   length=1
                   
                   buffer=buffer+fget(buffer_input)
                   lookahead=fget(lookahead_input)
                   pos = ftell(lookahead_input)
                   continue_the_search=1
                   
                    while ((nextchar=fget(lookahead_input))!=J(0,0,"") & (continue_the_search=1)) {
                       if(strmatch(buffer,"*"+lookahead+"*")) {
                       ++length
                       lookahead=lookahead+nextchar
                       continue_the_search=1
                       }
                       else {
                       continue_the_search=0 /* No match is found, so stop the search */
                       }             
                    }
    
    fseek(lookahead_input,pos,-1)                
    match_length=(match_length \ length)               
    }  
    
    fclose(buffer_input)  
    fclose(lookahead_input)
    end
    getmata match_length
    As mentioned above, the code works, but only for short string sequences. So I'm looking for solution that is much faster and that is able to process very long strings in a reasonable amount of time. Maybe there is a clever solution that records the position of a match in the buffer or something like that to minimize the costly search in the whole buffer.

    So if anyone has any ideas, please let me know, I would be very grateful.If you have any questions, just drop me a line...

    Many thanks

    Ali

    . about
    Stata/MP 14.0 for Windows (64-bit x86-64)
    Revision 22 Sep 2015
    Copyright 1985-2015 StataCorp LP

    Total physical memory: 96255544 KB
    Available physical memory: 60919896 KB





  • #2
    PS.: the description of the example should read "At position i = 11, the length of the "shortest mismatch" is li = 3, because both "t" and "ta" appear in the PRECEDING data, but "tat" does not." Sorry for that

    Comment

    Working...
    X