Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Speed of accessing/inputing from external data using -replace- vs. mata

    We've got a ton of data and tables stored in text files that we want to write into Stata data files for further processing. I've always assumed mata would be faster for this (especially saving everything into one matrix and then exporting to stata (putmata) one time rather than a loop of many -replace- statements), but upon testing the sandbox of code below, it seems that saving things in a matrix and then using putmata is much slower.

    Any thoughts/guidance about a better workflow to speed this up/optimize this via mata? (n.b., I'm very new to mata)
    Also if the bottle neck here is actually that we are using -file read-, we are open to other / faster approaches.


    Example:

    Code:
    ********************Creating file with a bunch of strings similar to our external data:
    clear all
    gen input=""
    se tr off
    forvalues i=1/10000 {
        set obs `=_N+1'
        replace input="`:word `=trunc(runiform()*26)' of `c(ALPHA)''`:word `=trunc(runiform()*26)' of `c(ALPHA)''`:word `=trunc(runiform()*26)' of `c(ALPHA)''`:word `=trunc(runiform()*26)' of `c(ALPHA)''`:word `=trunc(runiform()*26)' of `c(ALPHA)''" in `i'
    }
    
    file open test using test.txt, write replace
    
    forvalues i=1/10000 {
        file write test "`=input[`i']'" _n
    }
    
    file close test
    type test.txt, lines(10)
    
    ********************Testing original method dump each line of the file into an observation in the dataset
    clear
    timer on 1
    gen input=""
    file open test using "test.txt", read
    local runct=0
    file read test line
    while r(eof)==0 {
        local ++runct
        qui set obs `=_N+1'
        replace input=`"`line'"' in `runct'
        file read test line
    }
    timer off 1
    type test.txt, lines(10)
    
    ********************Testing new method append to a mata string matrix and then convert into observations
    clear
    timer on 2
    file close _all
    file open testm using "test.txt", read
    mata: input=`""'
    file read testm line
    while r(eof)==0 {
        mata: input=input \ `"`line'"'
        file read testm line
    }
    getmata input
    timer off 2
    
    mata:    input
    
    desc
    
    timer list 1
    timer list 2
    
    
    ******************************Testing for numbers to see if it is the string slowing things down*******************************************
    
    
    clear
    gen input=.
    se tr off
    forvalues i=1/1000 {
        set obs `=_N+1'
        replace input=`=trunc(runiform()*26)' in `i'
    }
    
    file open test2 using test2.txt, write replace
    
    forvalues i=1/1000 {
        file write test2 "`=input[`i']'" _n
    }
    
    file close test2
    
    clear
    timer on 3
    gen input=.
    file open test2 using "test2.txt", read
    local runct=0
    file read test2 line
    while r(eof)==0 {
        local ++runct
        qui set obs `=_N+1'
        replace input=`line' in `runct'
        file read test2 line
    }
    timer off 3
    clear
    timer on 4
    file open testm2 using "test2.txt", read
    local runct=0
    file read testm2 line
    while r(eof)==0 {
        local ++runct
        if `runct'==1 {
            mata: input=`line'
        }
        if `runct'>1 {
            mata: input=input \ `line'
        }
        file read testm2 line
    }
    getmata input
    timer off 4
    
    timer list 1
    timer list 2
     
    timer list 3
    timer list 4
    Last edited by eric_a_booth; 18 Oct 2017, 19:36.
    Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

  • #2
    A few things:
    • The first part of the code sample fails on my computer (Stata14), not sure why (somewhere in the big -replace- line an "invalid syntax" error is produced)
    • If you want speed, it's a *very bad* idea to resize things dynamically. Both matrices (x = x \ y) and datasets (set obs n+1). Try to avoid resizing as much as possible; for instance you can resize increasing by 1000 or so obs.
    • I think I must be missing something, but for what you are doing why not just do "insheet" or "import delimited" or even "infile".

    Comment


    • #3
      I think the main reason why the second method is slow is the line

      Code:
       mata: input = input \ `"`line'"'
      Appending a new line takes time, as Mata needs to get more memory and then copy the vector and the new line to the new location. In your case, assuming all you need is to read a text file and save it as a Stata file, I would try

      Code:
      clear
      mata: lines = cat("test.txt")
      getmata lines
      On my machine the timer shows 0.02 for this method, compared to 0.31 and 4.34 for the other two. You can't beat the built in cat().

      The getmata command is intended for interactive use and do files. If you plan to write an ado file you may want to look into st_sstore() instead (the extra s is for string)

      Edit: Sergio, I see we coincide again! I confirm the syntax error message. I agree there are also non-Mata solutions such as infile.
      Last edited by German Rodriguez; 18 Oct 2017, 21:35.

      Comment


      • #4
        Thanks Sergio and German. The cat() function German mentions is exactly what we were looking for (the problem is a sandbox/example of another problem where -infile-/-insheet-/etc doesnt fit, but I agree that generally that is the way to go if we were just importing data like this) - thanks for the quick help !

        Regarding the syntax error: It's because the part that says runiform()*26 will randomly evaluate to zero (0) on occasion and there is no word zero of `c(ALPHA)', if I change it to 1+runiform()*26 it runs without error.

        Thanks again!
        Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

        Comment

        Working...
        X