Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Getting a consistent runtime for parallel append

    Hi all, I have a question about the -parallel- package, and specifically consistency in runtime for parallel append.

    Our team is using the latest version of -parallel- from the authors’ github. We are running code on two machines, one with 8 cores, one with 16.

    Each of our input data gives the all-time workforce for some firm, and we want to run the same cleaning routine for each worker*firm. At the moment we are testing on a toy set of 17 files, but ultimately we will want to run it on >100k files.

    We are finding that the runtime on the toy data changes non-trivially between iterations, when using the same machine (e.g. 15 mins vs 20 mins vs 25 mins). Is there any way to (try to) make the runtime consistent, either within -parallel- or another way?

    e.g. I see that -parallel- has internal options to set seeds, but I am not clear if they are helpful for our use case (as opposed to e.g. bootstrapping). The -parallel- helpfile gives little detail on the seed option.

    Code below.

    Code:
    *--- install packages ---//
    * net install parallel, from(https://raw.github.com/gvegayon/parallel/master/) replace
    * ssc install filelist
    
    *--- macros + setup ---//
    global intrial [path to data]
    mata mata mlib index
    parallel clean, all force
    parallel initialise
    
    *--- write the programme ---///
    prog drop _all
    program test, rclass {
        [do some cleaning]
        end
        }
    
    *--- run it + test runtime ---//
    
    timer clear
    timer on 1
    
    * extract file/firm identifier
    filelist, dir($intrial)  pattern(firm_*.dta) norecur
    gen shortname = subinstr(filename, "firm_*.dta", "",.)
    isid shortname
    
    gen file_num = subinstr(shortname,".dta","",.)
    replace file_num = subinstr(file_num,"firm_","",.)
    destring file_num, replace
    
    qui sum file_num, det
    local min_file=r(min)   
    local max_file=r(max)
    
    parallel initialize
    parallel append, do(test) prog(test) e("$intrial/firm_%g.dta, `min_file'/`max_file'")
    
    timer off 1
    timer list
Working...
X