Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Code based on 'file read' and 'file write' too slow and inefficient

    The following code (Stata 13.1) counts instances of unique words by line in the file 'bible.txt'. However, the code runs too slowly, repeating words previously searched for which have to be skipped to speed up the search. Ideally, I would like to have a list of all the words searched for to be listed in the file 'bibleout.txt' with each unique word appearing only once. The search should skip common words like 'is, was, the, there...etc.' and these are listed in the file 'words.txt'. Any idea how to make make the code more efficient, compact and to skip words previously encountered? Example data are given below the program.

    capture file close myfile
    cd "c:\Users\username\location1"
    file open myfile using "bible.txt", read
    file read myfile line
    local r = r(eof)
    while `r' == 0 {
    local x : word count `line'
    disp _n (2) "`line'"
    local line : list uniq line
    qui import delimited bibleout.txt, clear
    foreach w of local line {
    cap assert strpos(v1, trim(itrim(`w'))) == 0 & ///
    strpos(v1, trim(itrim("`w'"))) == 0
    if _rc {
    local line : list line - w
    continue
    }
    }
    local x : word count `line'
    tokenize `"`line'"'
    local s1 (:|,|\.|\;|[0-9])$
    qui import delimited word.txt, clear
    forval word = 3/`x' {
    cap assert strpos(v1, trim(itrim("``word''"))) == 0
    if !_rc & !regexm("``word''","`s1'") {
    capture file close myfile2
    capture file close myline
    file open myfile2 using "bible.txt", read
    file open myline using "bibleout.txt", write append
    file read myfile2 line
    scalar k = 0
    while r(eof)==0 {
    local i length("`line'")
    local p length(subinstr("`line'", "``word''", "", .))
    local n length("``word''")
    scalar j = (`i' - `p')/(`n')
    scalar k = k + j
    local k = k
    file read myfile2 line
    }
    disp "The word ``word'' appears `k' times!"
    file write myline "``word''" _n
    continue
    }
    }
    local --r
    file read myfile line
    local r = r(eof)
    continue
    }

    Data Files are given below:

    1. File bible.txt is the source file, shortened here for space

    Genesis 1:1 In the beginning God created the heaven and the earth.
    Genesis 1:2 And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
    Genesis 1:3 And God said, Let there be light: and there was light.
    Genesis 1:4 And God saw the light, that it was good: and God divided the light from the darkness.
    Genesis 1:5 And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.

    2. File bibleout.txt is a list of unique words already searched for, implying that this file grows with more searches
    face
    Spirit
    divided
    darkness
    evening
    first
    firmament
    divide
    evening
    face

    3. File word.txt is a list of common words to be skipped in the search
    is
    when
    in
    the
    there
    were
    he
    from
    which
    under
    above


  • #2
    You could look at this presentation given at this year's German Stata Users' meeting: https://www.stata.com/meeting/german...8_Schonlau.pdf
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Thanks Maarten.

      Comment


      • #4
        This is a typical application for an associative array. For Stata 13, which you indicated you used, you would use help mata asarray(). In Stata 15 help mf_associativearray tends to look nicer. In this example I parsed through the entire King James Bible in 14 seconds of which a large portion was spend downloading the file.

        Code:
        clear all
        set rmsg on
        
        mata
        
        hist = asarray_create()                             // here we store our histogram
        asarray_notfound(hist,0)                            // if a word hasn't occured yet,
                                                            // it's frequency is 0
        
        notuse = asarray_create()                           // create a table of "bad" words
        bad = "is when in the there were he from which under above"
        bad = tokens(bad)
        for (i = 1 ; i <= cols(bad) ; i++) {
            asarray(notuse,bad[i], 1)                       // the number 1 is irrelevant
                                                            // we will only check if that words exists
        }
        
        fh = fopen("http://www.bibleprotector.com/TEXT-PCE-127.txt", "r") // load the entire (Énglish) bible
        while ((line=fget(fh))!=J(0,0,"")) {                //loop over the lines
            line = strlower(line)                           // don't distinguish between lower and upper case
            line = tokens(line)                             // break line into words
            for (i = 3 ; i<= cols(line); i++) {             // first two words are the book abreviatoin and the line
                if ( !asarray_contains(notuse, line[i]) ) { // ignore bad words
                    freq = asarray(hist, line[i]) + 1       // add 1 to that word's frequency
                    asarray(hist, line[i], freq)            // store the new frequency
                }
            }
        }
        fclose(fh)
        
        // now you can do things with this dictionary, e.g.:
        
        // Ok "the" is ignored
        asarray(hist,"the")
        
        // the word "king" happens 1787 times
        asarray(hist,"king")
        
        // we recognized 29,567 distinct words
        asarray_elements(hist)
        
        // see -help mata asarray()- for more options
        end
        Last edited by Maarten Buis; 13 Nov 2018, 03:22.
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          Once I stored the text locally, the algorithm was able to pass through it in 4 seconds. Some more stuff you can do with this:

          Code:
          clear all
          set rmsg on
          
          mata
          
          hist = asarray_create()                             // here we store our histogram
          asarray_notfound(hist,0)                            // if a word hasn't occured yet,
                                                              // it's frequency is 0
          
          notuse = asarray_create()                           // create a table of "bad" words
          bad = "is when in the there were he from which under above and of to that"
          bad = tokens(bad)
          for (i = 1 ; i <= cols(bad) ; i++) {
              asarray(notuse,bad[i], 1)                       // the number 1 is irrelevant
                                                              // we will only check if that words exists
          }
          
          punct = ", . ; : ! ? ( ) [ ] > <"                   // punctuation we want to remove
          punct = tokens(punct)
          
          fh = fopen("c:\temp\TEXT-PCE-127.txt", "r")         // load the entire (Énglish) bible, now locally
          while ((line=fget(fh))!=J(0,0,"")) {                //loop over the lines
              line = strlower(line)                           // don't distinguish between lower and upper case
              for (j=1; j<= cols(punct) ; j++) {
                  line = subinstr(line,punct[j], "")          // remove punctiation
              }
              line = tokens(line)                             // break line into words
              for (i = 3 ; i<= cols(line); i++) {             // first two words are the book abreviation and the line
                  if ( !asarray_contains(notuse, line[i]) ) { // ignore bad words
                      freq = asarray(hist, line[i]) + 1       // add 1 to that word's frequency
                      asarray(hist, line[i], freq)            // store the new frequency
                  }
              }
          }
          fclose(fh)
          
          // now you can do things with this dictionary, e.g.:
          
          // Ok "the" is ignored
          asarray(hist,"the")
          
          // the word "king" happens 2,256 times
          asarray(hist,"king")
          
          // we recognized 12,832 distinct words
          asarray_elements(hist)
          
          // we counted 567,101 words
          // A quick google search found that the King James Bible contains 783,137 words
          // but that includes our "bad" words, which apparently appeared about a 200,000 times.
          count = 0
          for (loc=asarray_first(hist); loc!=NULL; loc=asarray_next(hist, loc)) {
              count = count + asarray_contents(hist, loc)
          }
          count
          
          // Most common non-bad word is "shall" occuring 9,837 times
          largest = ""
          count = 0
          for (loc=asarray_first(hist); loc!=NULL; loc=asarray_next(hist, loc)) {
              if ( asarray_contents(hist, loc) > count ) {
                  count = asarray_contents(hist, loc)
                  largest = asarray_key(hist, loc)
              }
          }
          largest
          count
          
          // export to Stata
          k = asarray_elements(hist)
          word = J(k, 1, "")
          freq = J(k, 1, . )
          i = 0
          for (loc=asarray_first(hist); loc!=NULL; loc=asarray_next(hist, loc)) {
              i = i + 1
              word[i] = asarray_key(hist, loc)
              freq[i] = asarray_contents(hist, loc)
          }
          toadd = k - st_nobs()
          st_addobs(toadd)
          x = st_addvar("str2045","word")
          st_sstore(.,x,word)
          x = st_addvar("int","freq")
          st_store(.,x,freq)
          
          end
          
          // use graphs to display the distribution  
          spikeplot freq,                                            ///
              yscale(log) yscale(range(0.5 4000))                    ///
              xscale(log)                                            ///
              xtitle(number of occurances in King James Bible)       ///
              ytitle(number of words with that number of occurances) ///
              xlab(1 10 100 1000 10000, format(%9.0gc))              ///
              ylab(1 10 100 1000, format(%9.0gc))
          
          // look at the 20 most common words
          sort freq
          list in -20/l
          
          mata
          
          // find the most common combination of two words
          // "shall be" occuring 2,460 times
          histcombo = asarray_create("string",2)              // here we store the counts of two word combinations
          asarray_notfound(histcombo,0)                       // if a combination hasn't occured yet,
                                                              // it's frequency is 0
          
          fh = fopen("c:\temp\TEXT-PCE-127.txt", "r")         // load the entire (Énglish) bible, now locally
          while ((line=fget(fh))!=J(0,0,"")) {                // loop over the lines
              line = strlower(line)                           // don't distinguish between lower and upper case
              for (j=1; j<= cols(punct) ; j++) {
                  line = subinstr(line,punct[j], "")          // remove punctiation
              }    
              line = tokens(line)                             // break line into words
              for (i = 3 ; i < cols(line); i++) {             // don't continue to the very last word
                  if ( !asarray_contains(notuse, line[i]) &
                       !asarray_contains(notuse, line[i+1]) ) { // ignore bad words
                      key = line[i] , line[i+1]                // now we store counts for two word combinations
                      freq = asarray(histcombo, key) + 1       // add 1 to that combination's frequency
                      asarray(histcombo, key, freq)            // store the new frequency
                  }
              }
          }
          
          largest = "", ""
          count = 0
          for (loc=asarray_first(histcombo); loc!=NULL; loc=asarray_next(histcombo, loc)) {
              if ( asarray_contents(histcombo, loc) > count ) {
                  count = asarray_contents(histcombo, loc)
                  largest = asarray_key(histcombo, loc)
              }
          }
          largest
          count
          
          end
          ---------------------------------
          Maarten L. Buis
          University of Konstanz
          Department of history and sociology
          box 40
          78457 Konstanz
          Germany
          http://www.maartenbuis.nl
          ---------------------------------

          Comment


          • #6
            This one is also interesting:

            Code:
            mata
            asarray(hist, "woman") + asarray(hist, "women")
            asarray(hist, "man") + asarray(hist, "men")
            end
            ---------------------------------
            Maarten L. Buis
            University of Konstanz
            Department of history and sociology
            box 40
            78457 Konstanz
            Germany
            http://www.maartenbuis.nl
            ---------------------------------

            Comment


            • #7
              Maarten's offering is a tour de force! Here's an alternative approach, which requires only one line in Mata. It takes advantage of the neglected but powerful -fileread()- function. The following accomplished the minimal goal of producing a data set of the distinct words and their frequencies in less than 2 sec. on my garden-variety laptop:
              Code:
              clear
              mata mata clear
              timer clear 1
              timer on 1
              set obs 1  // fileread() requires an observation to hold the whole file
              gen strL s = fileread("c:/temp/bible.txt")    // heavy lifting here
              // original load from:  http://www.bibleprotector.com/TEXT-PCE-127.txt"
              //
              // Punctuation, upper case, and end of line markers are a nuisance
              replace s = lower(s)
              foreach c in , . ; : ! ? ( ) [ ] > <  {
                 qui replace s = subinstr(s, "`c'", "", .)
              }
              local eoln = char(13) + char(10) // Windows file format
              quiet replace s = subinstr(s, "`eoln'", " ", .)
              //
              // Use Mata to make each word a row of a Mata vector.
              putmata bible = s   // quick and dirty
              mata word = (tokens(bible))'
              //
              // Back in Stata
              clear
              getmata word = word
              bysort word: gen frequency = _N
              by word: keep if _n == 1  // no duplicates wanted
              timer off 1
              timer list 1
              //
              sort frequency  // for browsing
              // Some examples
              list if word == "woman"
              list if word == "man"
              list if word == "sin"
              I noticed some strange words in the listing, but as near as I can tell, they occur in the original text.

              Comment


              • #8
                Maarten and Mike:

                Thank you very much. The suggested codes work like charm. I can't thank you enough.

                Will report back after trying a few tweaks.

                Comment

                Working...
                X