Improving Loop Efficiency for Data Appending in Stata 17

Edward Lee

Join Date: Feb 2024

Posts: 2
#1

Improving Loop Efficiency for Data Appending in Stata 17

27 Feb 2024, 20:36

Hello everyone, I am currently facing an issue regarding the efficiency of a loop operation in Stata 17 for Windows.

My goal is to append multiple .dta files using a loop. However, I have noticed that the process is considerably slow. There are 300 "nfile" and each "nfile" takes 1 minute. So it takes too much time. Below, I've included the relevant codes:

Code:

forvalues v=2/`nfile' { use "${tmp_path}/ind.dta", clear append using "${tmp_path}/ctytemp.dta" quietly save "${tmp_path}/ctytemp.dta", replace }

Thank you
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#2

27 Feb 2024, 21:07

Well, you are doing a lot of disk thrashing by saving an intermediate result at each iteration. You can take out the -quietly save...- command from the loop and you will get the command somewhat faster.

But, more serious is that this loop isn't even appending different files anyway. It's just appending 300 copies of the one file ${tmp_path}/ind.dta. I don't know what your filenames actually are, but you loop isn't dealing with them at all. In order for this loop to work properly, the -append using- command needs to refer to a different filename at each iteration. That means that the filename in that command should somehow involve `v'.

Finally, you don't really need to loop this append process. You can instead do something modeled on this:

Code:

local filelist forvalues v = 1/`nfile' { local filelist `filelist' some_expression_involving_`v'_that_names_the_`v''th_file } append using `filelist'

I believe that will be faster even than the modified loop I referred to in the first paragraph.
Comment
Edward Lee

Join Date: Feb 2024

Posts: 2
#3

03 Mar 2024, 19:32

I am sorry for omitting one line of code involving `v'. The reason why I need to save file again and again is because I want to create dataset that each county has exactly same industry codes . So I need to append it again and again. The complete codes are in following:

Code:

forvalues v=2/`nfile' { use "${tmp_path}/ind.dta", clear gen countyct=`v' append using "${tmp_path}/ctytemp.dta" quietly save "${tmp_path}/ctytemp.dta", replace } summ countyct tab id_ind

Thank you!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#4

03 Mar 2024, 20:05

OK, but that code still doesn't append 300 files. Instead it reads in the same file, ${tmp_path}/ind.dta, repeatedly, 300 time, and each time it creates a variable, countyct. The value of countyct does count up from 1 to `nfile' with each iteration of the loop, and the final result is the 300 copies appended together, with the variable countyct indexing the copies.

If that's what you really want to do, there is a quicker way that avoids reading the same file over and over again.

Code:

clear* use "${tmp_path}/ind.dta", clear gen `c(obs_t)' = _n expand `nfile' by obs_no, sort: gen countyct = _n sort countyct obs_no drop obs_no save "${tmp_path}/ctytemp.dta", replace

This will still take a bit of time because that ind.dta file must be pretty large or your original code would not have taken very long. And so, -expand-ing and -sort-ing and -save-ing that will be bottlenecks as well. But it will be a lot faster than reading and writing it 300 times!
Comment

Announcement

Improving Loop Efficiency for Data Appending in Stata 17

Comment

Comment

Comment