Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Improving Loop Efficiency for Data Appending in Stata 17

    Hello everyone, I am currently facing an issue regarding the efficiency of a loop operation in Stata 17 for Windows.

    My goal is to append multiple .dta files using a loop. However, I have noticed that the process is considerably slow. There are 300 "nfile" and each "nfile" takes 1 minute. So it takes too much time. Below, I've included the relevant codes:

    Code:
    forvalues v=2/`nfile' {
       use "${tmp_path}/ind.dta", clear
       append using "${tmp_path}/ctytemp.dta"
       quietly save "${tmp_path}/ctytemp.dta", replace
    }
    Thank you

  • #2
    Well, you are doing a lot of disk thrashing by saving an intermediate result at each iteration. You can take out the -quietly save...- command from the loop and you will get the command somewhat faster.

    But, more serious is that this loop isn't even appending different files anyway. It's just appending 300 copies of the one file ${tmp_path}/ind.dta. I don't know what your filenames actually are, but you loop isn't dealing with them at all. In order for this loop to work properly, the -append using- command needs to refer to a different filename at each iteration. That means that the filename in that command should somehow involve `v'.

    Finally, you don't really need to loop this append process. You can instead do something modeled on this:
    Code:
    local filelist
    forvalues v = 1/`nfile' {
        local filelist `filelist' some_expression_involving_`v'_that_names_the_`v''th_file
    }
    append using `filelist'
    I believe that will be faster even than the modified loop I referred to in the first paragraph.

    Comment


    • #3
      I am sorry for omitting one line of code involving `v'. The reason why I need to save file again and again is because I want to create dataset that each county has exactly same industry codes . So I need to append it again and again. The complete codes are in following:

      Code:
      forvalues v=2/`nfile' {
         use "${tmp_path}/ind.dta", clear
         gen countyct=`v'
         append using "${tmp_path}/ctytemp.dta"
         quietly save "${tmp_path}/ctytemp.dta", replace
      }
      summ countyct
      tab id_ind

      Thank you!

      Comment


      • #4
        OK, but that code still doesn't append 300 files. Instead it reads in the same file, ${tmp_path}/ind.dta, repeatedly, 300 time, and each time it creates a variable, countyct. The value of countyct does count up from 1 to `nfile' with each iteration of the loop, and the final result is the 300 copies appended together, with the variable countyct indexing the copies.

        If that's what you really want to do, there is a quicker way that avoids reading the same file over and over again.
        Code:
        clear*
        use "${tmp_path}/ind.dta", clear
        gen `c(obs_t)' = _n
        expand `nfile'
        by obs_no, sort: gen countyct = _n
        sort countyct obs_no
        drop obs_no
        save "${tmp_path}/ctytemp.dta", replace
        This will still take a bit of time because that ind.dta file must be pretty large or your original code would not have taken very long. And so, -expand-ing and -sort-ing and -save-ing that will be bottlenecks as well. But it will be a lot faster than reading and writing it 300 times!

        Comment

        Working...
        X