Code based on 'file read' and 'file write' too slow and inefficient

Eltony Mugomeri

Join Date: Nov 2014

Posts: 4
#1

Code based on 'file read' and 'file write' too slow and inefficient

06 Nov 2018, 04:41

The following code (Stata 13.1) counts instances of unique words by line in the file 'bible.txt'. However, the code runs too slowly, repeating words previously searched for which have to be skipped to speed up the search. Ideally, I would like to have a list of all the words searched for to be listed in the file 'bibleout.txt' with each unique word appearing only once. The search should skip common words like 'is, was, the, there...etc.' and these are listed in the file 'words.txt'. Any idea how to make make the code more efficient, compact and to skip words previously encountered? Example data are given below the program.

capture file close myfile
cd "c:\Users\username\location1"
file open myfile using "bible.txt", read
file read myfile line
local r = r(eof)
while `r' == 0 {
local x : word count `line'
disp _n (2) "`line'"
local line : list uniq line
qui import delimited bibleout.txt, clear
foreach w of local line {
cap assert strpos(v1, trim(itrim(`w'))) == 0 & ///
strpos(v1, trim(itrim("`w'"))) == 0
if _rc {
local line : list line - w
continue
}
}
local x : word count `line'
tokenize `"`line'"'
local s1 (:|,|\.|\;|[0-9])$
qui import delimited word.txt, clear
forval word = 3/`x' {
cap assert strpos(v1, trim(itrim("``word''"))) == 0
if !_rc & !regexm("``word''","`s1'") {
capture file close myfile2
capture file close myline
file open myfile2 using "bible.txt", read
file open myline using "bibleout.txt", write append
file read myfile2 line
scalar k = 0
while r(eof)==0 {
local i length("`line'")
local p length(subinstr("`line'", "``word''", "", .))
local n length("``word''")
scalar j = (`i' - `p')/(`n')
scalar k = k + j
local k = k
file read myfile2 line
}
disp "The word ``word'' appears `k' times!"
file write myline "``word''" _n
continue
}
}
local --r
file read myfile line
local r = r(eof)
continue
}

Data Files are given below:

1. File bible.txt is the source file, shortened here for space

Genesis 1:1 In the beginning God created the heaven and the earth.
Genesis 1:2 And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
Genesis 1:3 And God said, Let there be light: and there was light.
Genesis 1:4 And God saw the light, that it was good: and God divided the light from the darkness.
Genesis 1:5 And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.

2. File bibleout.txt is a list of unique words already searched for, implying that this file grows with more searches
face
Spirit
divided
darkness
evening
first
firmament
divide
evening
face

3. File word.txt is a list of common words to be skipped in the search
is
when
in
the
there
were
he
from
which
under
above
Tags: None
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#2

06 Nov 2018, 05:06

You could look at this presentation given at this year's German Stata Users' meeting: https://www.stata.com/meeting/german...8_Schonlau.pdf

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Eltony Mugomeri

Join Date: Nov 2014

Posts: 4
#3

06 Nov 2018, 05:19

Thanks Maarten.
Comment

Maarten Buis

Join Date: Mar 2014
Posts: 3426

13 Nov 2018, 03:20

This is a typical application for an associative array. For Stata 13, which you indicated you used, you would use help mata asarray(). In Stata 15 help mf_associativearray tends to look nicer. In this example I parsed through the entire King James Bible in 14 seconds of which a large portion was spend downloading the file.

Code:

clear all
set rmsg on

mata

hist = asarray_create()                             // here we store our histogram
asarray_notfound(hist,0)                            // if a word hasn't occured yet,
                                                    // it's frequency is 0

notuse = asarray_create()                           // create a table of "bad" words
bad = "is when in the there were he from which under above"
bad = tokens(bad)
for (i = 1 ; i <= cols(bad) ; i++) {
    asarray(notuse,bad[i], 1)                       // the number 1 is irrelevant
                                                    // we will only check if that words exists
}

fh = fopen("http://www.bibleprotector.com/TEXT-PCE-127.txt", "r") // load the entire (Énglish) bible
while ((line=fget(fh))!=J(0,0,"")) {                //loop over the lines
    line = strlower(line)                           // don't distinguish between lower and upper case
    line = tokens(line)                             // break line into words
    for (i = 3 ; i<= cols(line); i++) {             // first two words are the book abreviatoin and the line
        if ( !asarray_contains(notuse, line[i]) ) { // ignore bad words
            freq = asarray(hist, line[i]) + 1       // add 1 to that word's frequency
            asarray(hist, line[i], freq)            // store the new frequency
        }
    }
}
fclose(fh)

// now you can do things with this dictionary, e.g.:

// Ok "the" is ignored
asarray(hist,"the")

// the word "king" happens 1787 times
asarray(hist,"king")

// we recognized 29,567 distinct words
asarray_elements(hist)

// see -help mata asarray()- for more options
end

Last edited by Maarten Buis; 13 Nov 2018, 03:22.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------

Comment

Maarten Buis

Join Date: Mar 2014
Posts: 3426

13 Nov 2018, 06:52

Once I stored the text locally, the algorithm was able to pass through it in 4 seconds. Some more stuff you can do with this:

Code:

clear all
set rmsg on

mata

hist = asarray_create()                             // here we store our histogram
asarray_notfound(hist,0)                            // if a word hasn't occured yet,
                                                    // it's frequency is 0

notuse = asarray_create()                           // create a table of "bad" words
bad = "is when in the there were he from which under above and of to that"
bad = tokens(bad)
for (i = 1 ; i <= cols(bad) ; i++) {
    asarray(notuse,bad[i], 1)                       // the number 1 is irrelevant
                                                    // we will only check if that words exists
}

punct = ", . ; : ! ? ( ) [ ] > <"                   // punctuation we want to remove
punct = tokens(punct)

fh = fopen("c:\temp\TEXT-PCE-127.txt", "r")         // load the entire (Énglish) bible, now locally
while ((line=fget(fh))!=J(0,0,"")) {                //loop over the lines
    line = strlower(line)                           // don't distinguish between lower and upper case
    for (j=1; j<= cols(punct) ; j++) {
        line = subinstr(line,punct[j], "")          // remove punctiation
    }
    line = tokens(line)                             // break line into words
    for (i = 3 ; i<= cols(line); i++) {             // first two words are the book abreviation and the line
        if ( !asarray_contains(notuse, line[i]) ) { // ignore bad words
            freq = asarray(hist, line[i]) + 1       // add 1 to that word's frequency
            asarray(hist, line[i], freq)            // store the new frequency
        }
    }
}
fclose(fh)

// now you can do things with this dictionary, e.g.:

// Ok "the" is ignored
asarray(hist,"the")

// the word "king" happens 2,256 times
asarray(hist,"king")

// we recognized 12,832 distinct words
asarray_elements(hist)

// we counted 567,101 words
// A quick google search found that the King James Bible contains 783,137 words
// but that includes our "bad" words, which apparently appeared about a 200,000 times.
count = 0
for (loc=asarray_first(hist); loc!=NULL; loc=asarray_next(hist, loc)) {
    count = count + asarray_contents(hist, loc)
}
count

// Most common non-bad word is "shall" occuring 9,837 times
largest = ""
count = 0
for (loc=asarray_first(hist); loc!=NULL; loc=asarray_next(hist, loc)) {
    if ( asarray_contents(hist, loc) > count ) {
        count = asarray_contents(hist, loc)
        largest = asarray_key(hist, loc)
    }
}
largest
count

// export to Stata
k = asarray_elements(hist)
word = J(k, 1, "")
freq = J(k, 1, . )
i = 0
for (loc=asarray_first(hist); loc!=NULL; loc=asarray_next(hist, loc)) {
    i = i + 1
    word[i] = asarray_key(hist, loc)
    freq[i] = asarray_contents(hist, loc)
}
toadd = k - st_nobs()
st_addobs(toadd)
x = st_addvar("str2045","word")
st_sstore(.,x,word)
x = st_addvar("int","freq")
st_store(.,x,freq)

end

// use graphs to display the distribution  
spikeplot freq,                                            ///
    yscale(log) yscale(range(0.5 4000))                    ///
    xscale(log)                                            ///
    xtitle(number of occurances in King James Bible)       ///
    ytitle(number of words with that number of occurances) ///
    xlab(1 10 100 1000 10000, format(%9.0gc))              ///
    ylab(1 10 100 1000, format(%9.0gc))

// look at the 20 most common words
sort freq
list in -20/l

mata

// find the most common combination of two words
// "shall be" occuring 2,460 times
histcombo = asarray_create("string",2)              // here we store the counts of two word combinations
asarray_notfound(histcombo,0)                       // if a combination hasn't occured yet,
                                                    // it's frequency is 0

fh = fopen("c:\temp\TEXT-PCE-127.txt", "r")         // load the entire (Énglish) bible, now locally
while ((line=fget(fh))!=J(0,0,"")) {                // loop over the lines
    line = strlower(line)                           // don't distinguish between lower and upper case
    for (j=1; j<= cols(punct) ; j++) {
        line = subinstr(line,punct[j], "")          // remove punctiation
    }    
    line = tokens(line)                             // break line into words
    for (i = 3 ; i < cols(line); i++) {             // don't continue to the very last word
        if ( !asarray_contains(notuse, line[i]) &
             !asarray_contains(notuse, line[i+1]) ) { // ignore bad words
            key = line[i] , line[i+1]                // now we store counts for two word combinations
            freq = asarray(histcombo, key) + 1       // add 1 to that combination's frequency
            asarray(histcombo, key, freq)            // store the new frequency
        }
    }
}

largest = "", ""
count = 0
for (loc=asarray_first(histcombo); loc!=NULL; loc=asarray_next(histcombo, loc)) {
    if ( asarray_contents(histcombo, loc) > count ) {
        count = asarray_contents(histcombo, loc)
        largest = asarray_key(histcombo, loc)
    }
}
largest
count

end

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------

Comment

Maarten Buis

Join Date: Mar 2014

Posts: 3426
#6

13 Nov 2018, 07:08

This one is also interesting:

Code:

mata asarray(hist, "woman") + asarray(hist, "women") asarray(hist, "man") + asarray(hist, "men") end

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
2 likes
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2404

14 Nov 2018, 08:46

Maarten's offering is a tour de force! Here's an alternative approach, which requires only one line in Mata. It takes advantage of the neglected but powerful -fileread()- function. The following accomplished the minimal goal of producing a data set of the distinct words and their frequencies in less than 2 sec. on my garden-variety laptop:

Code:

clear
mata mata clear
timer clear 1
timer on 1
set obs 1  // fileread() requires an observation to hold the whole file
gen strL s = fileread("c:/temp/bible.txt")    // heavy lifting here
// original load from:  http://www.bibleprotector.com/TEXT-PCE-127.txt"
//
// Punctuation, upper case, and end of line markers are a nuisance
replace s = lower(s)
foreach c in , . ; : ! ? ( ) [ ] > <  {
   qui replace s = subinstr(s, "`c'", "", .)
}
local eoln = char(13) + char(10) // Windows file format
quiet replace s = subinstr(s, "`eoln'", " ", .)
//
// Use Mata to make each word a row of a Mata vector.
putmata bible = s   // quick and dirty
mata word = (tokens(bible))'
//
// Back in Stata
clear
getmata word = word
bysort word: gen frequency = _N
by word: keep if _n == 1  // no duplicates wanted
timer off 1
timer list 1
//
sort frequency  // for browsing
// Some examples
list if word == "woman"
list if word == "man"
list if word == "sin"

I noticed some strange words in the listing, but as near as I can tell, they occur in the original text.

Comment

Eltony Mugomeri

Join Date: Nov 2014

Posts: 4
#8

20 Nov 2018, 23:44

Maarten and Mike:

Thank you very much. The suggested codes work like charm. I can't thank you enough.

Will report back after trying a few tweaks.
Comment

Announcement