Error parallelising foreach command

Ciaran OFlynn

Join Date: Jul 2018
Posts: 31

Error parallelising foreach command

28 Feb 2023, 11:25

Hi,

I'm trying to split a master dataset into its constituent country parts. I'm using large datasets (300g) that currently are taking more than 48 hours to run, so any help speeding the process would be much appreciated.

Using this data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str20 bvd_id_number strL main_activity str2 countrycode
"CN9360430024" "Manufacturing" "CN"
"AU072891993"  "Services"      "AU"
"US149668182L" "Manufacturing" "US"
"US133096011L" "Services"      "US"
"CA32531NC"    "Services"      "CA"
end

and this Stata code:

Code:


//create country list
glevelsof countrycode, local(countries)

//timer on

timer on 1

parallel: foreach c of local countries {  
    use overviews.dta, clear
    keep if countrycode == "`c'"
    save `c', replace    
}

timer off 1

timer list 1

Code:

 //timer on
. 
. timer on 1

. 
. parallel: foreach c of local countries {  
--------------------------------------------------------------------------------
Parallel Computing with Stata (by GVY)
Clusters   : 4
pll_id     : rp2wznupm1
Running at : D:\Firmographics\overviews\parallell_test
Randtype   : datetime
Waiting for the clusters to finish...
  -3621
cluster 0004 has exited without error...
  -3621
cluster 0001 has exited without error...
  -3621
cluster 0002 has exited without error...
  -3621
cluster 0003 has exited without error...
--------------------------------------------------------------------------------
Enter -parallel printlog #- to checkout logfiles.
--------------------------------------------------------------------------------
                unlink():  3621  attempt to write read-only file
parallel_recursively_rm():     -  function returned error
        parallel_clean():     -  function returned error
                 <istmt>:     -  function returned error
r(3621);

end of do-file

Thanks

Ciaran

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29911
#2

28 Feb 2023, 11:49

Well, I think this computation is probably I/O bound and I'm not sure how much you can speed it up. But there are a few things that can be done. For one thing, you are reading in the entire data set over and over again for each country. Second, you are iterating over the levels of countrycode, requiring you to apply an -if countrycode == "`c'"- qualifier to every observation in the complete data set at each iteration. Both of these problems can be overcome by using -runby-.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str20 bvd_id_number strL main_activity str2 countrycode "CN9360430024" "Manufacturing" "CN" "AU072891993" "Services" "AU" "US149668182L" "Manufacturing" "US" "US133096011L" "Services" "US" "CA32531NC" "Services" "CA" end capture program drop one_country program define one_country local c = countrycode[1] save `c', replace clear exit end runby one_country, by(countrycode) status

-runby- is written by Robert Picard and me, and is available from SSC. Note: the status option causes -runby- to give you a progress report periodically. It will tell you how many countries have been processed so far, in how much time, and give an estimate of the time remaining.

As I say, given all the file saving you have to do, I'm not sure how much time can be saved here. But at least you'll only have to read the whole big file in once, and you will not have to evaluate any -if- qualifiers. I'm sure it will be noticeable, but it may not be dramatic.
Comment
Ciaran OFlynn

Join Date: Jul 2018

Posts: 31
#3

28 Feb 2023, 12:54

Hi Clyde,

I tried this on one of the smaller files just now and the speed increase was dramatic! Thank you!
Comment

Announcement

Error parallelising foreach command

Comment

Comment