Hi all, I have a question about the -parallel- package, and specifically consistency in runtime for parallel append.
Our team is using the latest version of -parallel- from the authors’ github. We are running code on two machines, one with 8 cores, one with 16.
Each of our input data gives the all-time workforce for some firm, and we want to run the same cleaning routine for each worker*firm. At the moment we are testing on a toy set of 17 files, but ultimately we will want to run it on >100k files.
We are finding that the runtime on the toy data changes non-trivially between iterations, when using the same machine (e.g. 15 mins vs 20 mins vs 25 mins). Is there any way to (try to) make the runtime consistent, either within -parallel- or another way?
e.g. I see that -parallel- has internal options to set seeds, but I am not clear if they are helpful for our use case (as opposed to e.g. bootstrapping). The -parallel- helpfile gives little detail on the seed option.
Code below.
Our team is using the latest version of -parallel- from the authors’ github. We are running code on two machines, one with 8 cores, one with 16.
Each of our input data gives the all-time workforce for some firm, and we want to run the same cleaning routine for each worker*firm. At the moment we are testing on a toy set of 17 files, but ultimately we will want to run it on >100k files.
We are finding that the runtime on the toy data changes non-trivially between iterations, when using the same machine (e.g. 15 mins vs 20 mins vs 25 mins). Is there any way to (try to) make the runtime consistent, either within -parallel- or another way?
e.g. I see that -parallel- has internal options to set seeds, but I am not clear if they are helpful for our use case (as opposed to e.g. bootstrapping). The -parallel- helpfile gives little detail on the seed option.
Code below.
Code:
*--- install packages ---// * net install parallel, from(https://raw.github.com/gvegayon/parallel/master/) replace * ssc install filelist *--- macros + setup ---// global intrial [path to data] mata mata mlib index parallel clean, all force parallel initialise *--- write the programme ---/// prog drop _all program test, rclass { [do some cleaning] end } *--- run it + test runtime ---// timer clear timer on 1 * extract file/firm identifier filelist, dir($intrial) pattern(firm_*.dta) norecur gen shortname = subinstr(filename, "firm_*.dta", "",.) isid shortname gen file_num = subinstr(shortname,".dta","",.) replace file_num = subinstr(file_num,"firm_","",.) destring file_num, replace qui sum file_num, det local min_file=r(min) local max_file=r(max) parallel initialize parallel append, do(test) prog(test) e("$intrial/firm_%g.dta, `min_file'/`max_file'") timer off 1 timer list