We've got a ton of data and tables stored in text files that we want to write into Stata data files for further processing. I've always assumed mata would be faster for this (especially saving everything into one matrix and then exporting to stata (putmata) one time rather than a loop of many -replace- statements), but upon testing the sandbox of code below, it seems that saving things in a matrix and then using putmata is much slower.
Any thoughts/guidance about a better workflow to speed this up/optimize this via mata? (n.b., I'm very new to mata)
Also if the bottle neck here is actually that we are using -file read-, we are open to other / faster approaches.
Example:
Any thoughts/guidance about a better workflow to speed this up/optimize this via mata? (n.b., I'm very new to mata)
Also if the bottle neck here is actually that we are using -file read-, we are open to other / faster approaches.
Example:
Code:
********************Creating file with a bunch of strings similar to our external data: clear all gen input="" se tr off forvalues i=1/10000 { set obs `=_N+1' replace input="`:word `=trunc(runiform()*26)' of `c(ALPHA)''`:word `=trunc(runiform()*26)' of `c(ALPHA)''`:word `=trunc(runiform()*26)' of `c(ALPHA)''`:word `=trunc(runiform()*26)' of `c(ALPHA)''`:word `=trunc(runiform()*26)' of `c(ALPHA)''" in `i' } file open test using test.txt, write replace forvalues i=1/10000 { file write test "`=input[`i']'" _n } file close test type test.txt, lines(10) ********************Testing original method dump each line of the file into an observation in the dataset clear timer on 1 gen input="" file open test using "test.txt", read local runct=0 file read test line while r(eof)==0 { local ++runct qui set obs `=_N+1' replace input=`"`line'"' in `runct' file read test line } timer off 1 type test.txt, lines(10) ********************Testing new method append to a mata string matrix and then convert into observations clear timer on 2 file close _all file open testm using "test.txt", read mata: input=`""' file read testm line while r(eof)==0 { mata: input=input \ `"`line'"' file read testm line } getmata input timer off 2 mata: input desc timer list 1 timer list 2 ******************************Testing for numbers to see if it is the string slowing things down******************************************* clear gen input=. se tr off forvalues i=1/1000 { set obs `=_N+1' replace input=`=trunc(runiform()*26)' in `i' } file open test2 using test2.txt, write replace forvalues i=1/1000 { file write test2 "`=input[`i']'" _n } file close test2 clear timer on 3 gen input=. file open test2 using "test2.txt", read local runct=0 file read test2 line while r(eof)==0 { local ++runct qui set obs `=_N+1' replace input=`line' in `runct' file read test2 line } timer off 3 clear timer on 4 file open testm2 using "test2.txt", read local runct=0 file read testm2 line while r(eof)==0 { local ++runct if `runct'==1 { mata: input=`line' } if `runct'>1 { mata: input=input \ `line' } file read testm2 line } getmata input timer off 4 timer list 1 timer list 2 timer list 3 timer list 4
Comment