Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Re-opening a large dataset vs. preserve/restore inside a loop

    I have a very very large dataset (~200 mil obs) which I am opening inside of a loop. I'm wondering whether it would be faster to open it once and use preserve/restore inside of the loop instead. Sample code below and any help would be greatly appreciated.

    forvalues i=1/$max{
    use "XXX\large_data.dta", clear
    keep if geocode="`i'"
    save "XXX\large_data_`i'.dta", replace
    }

    or

    use "XXX\large_data.dta", clear

    forvalues i=$max{
    preserve
    keep if geocode="`i'"
    save "XXX\large_data_`i'.dta", replace
    restore
    }


  • #2
    You can use timer to find out which code is faster.
    Code:
    help timer

    Comment


    • #3
      From my understanding, -preserve- does nothing more than (temporarily) saving a dataset to your hard drive, and -restore- re-opens this temporary dataset.

      Thus, your second code block should be slower, as it has to (1) use the data once, and inside of the loop (2) save the dataset and (3) re-open it repeatedly. I would expect this to be slower than your first code block.

      Proof with example data:
      Code:
      tempfile largedata
      
      // create a large dataset
      sysuse auto , clear
      expand 100000
      save `"`largedata'"'
      
      // clear timers
      timer clear
      
      // variant 1: with repeated -use-
      timer on 1
      forvalues num=1/20 {
          use `"`largedata'"' , clear
          keep if make=="AMC Pacer"
          * do something here
          
      }
      timer off 1
      
      // variant 2: with repeated -preserve- and -restore-
      timer on 2
      use `"`largedata'"' , clear
      forvalues num=1/20 {
          preserve
          keep if make=="AMC Pacer"
          * do something here
          restore
      }
      timer off 2
      
      // variant 3: with repeated -use using-
      timer on 3
      forvalues num=1/20 {
          use if make=="AMC Pacer" using `"`largedata'"' , clear
          * do something here
          
      }
      timer off 3
      
      // compare execution times
      timer list
      The only substantial speedup I can imagine is to only read the subset of data you want to process in the first place, as illustrated in the third variant of my code: -use- can be directly combined with -if-, so that Stata does not have to read in the whole dataset from your file system.

      Regards
      Bela

      PS: Crossed with Friedrich Huebler's answer; he is, of course, correct.
      Last edited by Daniel Bela; 06 Sep 2017, 11:26. Reason: added PS

      Comment


      • #4
        If I recall correctly, there is an earlier thread on this question and it was concluded that saving the data in a tempfile and re-reading that tempfile each time is faster than using -preserve- and -restore-.

        But here's a thought that might be faster still for your particular application. It looks like what you are doing is chopping up the large data set into a series of smaller data sets, each of the latter corresponding to a single value of geocode.

        So start by reading the full data set. Sort it on geocode and save it in a tempfile. Then create an index of the observation numbers corresponding to each geocode in the sorted data set. So something like this:

        Code:
        use XXX/large_data, clear
        gen long obs_no = _n
        sort geocode
        tempfile holding
        save `holding'
        by geocode: gen long gcount = _N
        local index
        local pointer 1
        while `pointer' <= _N {
            local index `pointer'
            local pointer = `pointer' + gcount[`pointer']
        }
        Now I would loop over local macro index, reading in from the tempfile only those observations in the range of the current value of geocode

        Code:
        tokenize index
        while "`1'" != "" {
            use in `1'/`=`2'-1' using `holding', clear
            local i = geocode[1]
            save XXX/large_data_`i', replace
            macro shift
        }
        The advantage of this is that -in- is much faster than -if-. (O(1) vs O(N)) and you also don't have to read in the whole file each time and then select what to keep: you only read in what you need (though there is some overhead finding the range of observations desired).

        I haven't actually tried this out, so there may be some typos or I may have end-point errors on the loop handling, but I think this will be your best bet.

        Added: Crossed with #2 and #3.
        Last edited by Clyde Schechter; 06 Sep 2017, 11:37.

        Comment


        • #5
          I hadnt thought about using the -while- loop in this way - it's really interesting/useful, thanks Clyde! A few minor edits (in bold below) to your code to get it to work:

          Code:
          local index
          local pointer 1
          while `pointer' <= _N {
              local index `index' `pointer'
              local pointer = `pointer' + gcount[`pointer']
          }
           tokenize `"`index'"'
          di `"`index'"'
          while "`1'" != "" {
              use in `1'/`=`2'-1' using `holding', clear
              local i = geocode[1]
              save XXX/large_data_`i', replace
              macro shift
          }
          Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

          Comment


          • #6
            -in- doesn't actually save any time. In a test program:


            Code:
            set rmsg on
            set more off
            set obs 1000000000
            generate x=_n
            save stuff,replace
            clear
            use stuff
            clear
            use in 1/1 using stuff
            clear
            use stuff

            the three -use- statements all take about the same about of time (5.37 to 5.71 seconds)
            on a small server. The -in- statement isn't even on the run with the shortest time. My
            understanding is that recent .dta files keep some dataset metadata at the end of the
            file, so the whole file needs to be read anyway. This is really unfortunate, and I wish is
            would be reversed at some point.

            I realize the read is taking place out of cache but that doesn't really affect my argument.

            Comment


            • #7
              Here's what I get:

              Code:
              . clear*
              r; t=0.00 12:37:10
              
              . set rmsg on
              r; t=0.00 12:37:10
              
              . set more off
              r; t=0.00 12:37:10
              
              . set obs 100000000
              number of observations (_N) was 0, now 100,000,000
              r; t=1.06 12:37:11
              
              . generate long x=_n
              r; t=1.99 12:37:13
              
              . save stuff,replace
              file stuff.dta saved
              r; t=2.13 12:37:15
              
              . clear
              r; t=0.00 12:37:15
              
              . use stuff
              r; t=1.88 12:37:17
              
              . clear
              r; t=0.00 12:37:17
              
              . use in 1/1000000 using stuff
              r; t=0.08 12:37:17
              
              . clear
              r; t=0.00 12:37:17
              
              . use stuff
              r; t=1.93 12:37:19
              The full file has 108 observations. The full read takes about 1.9 seconds. The chunk read with -in- is 104 observations. and takes 0.08 seconds to read. If you were breaking up the file into 100 chunks of 106 observations each, those would each take 0.08 seconds, so a total of 8 seconds spent reading all 100 chunks. By contrast if you had to read the entire file 100 times you would have 190 seconds of reading. That's a huge speedup.

              I evidently changed your test case around, reading in chunks of 106 observations instead of 1 observation. But I think that's a more realistic test case. And if you make it 1,000 chunks 105 instead, you get very similar results.
              Last edited by Clyde Schechter; 24 Sep 2017, 15:00. Reason: Correct typos.

              Comment


              • #8
                A few notes:
                • For some reason, I don't see the speedups described by Clyde, my runtime is the same whether I "use in 1/10" or if I use the whole thing (on Stata 14.2 Windows)
                • Nonetheless, the suggestion about sorting and then loading by chunks makes a lot of sense, except that Stata's implementation of sort is O(N log(N)) which removes all the time savings of using an O(1) operation instead of an O(N) operation.
                Now, I'm not sure what MB ultimately wanted to do, but one option is to halve the dataset, in a depth-first algorithm:
                1. Start with a dataset with N obs that you want to split in 16
                2. Use the dataset, keep the first half
                3. Use the first half, then keep the first half
                4. Repeat until you reach the desired size, and back up to the other halves
                In this example, you would only end up -use-ing roughly 2x the data of the original dataset, instead of 16x if you use/keep/save.

                Comment


                • #9
                  1/1 or 1/1000000 doesn't make any difference in my runs. I am using Stata 14.2 for my tests - could it be that Stata has fixed this problem in version 15? Are you using 15? I'd still like to see row selection on the save statement - a lot simpler than "halve the dataset, in a depth-first algorithm".

                  Comment


                  • #10
                    Yes, I am running 15. But my understanding is that the data set structure has not changed between versions 14 and 15.

                    With regard to #8, the logic of the code in #4 is based on replacing -if- with -in-. -if- is an O(N) operation, and -in- is O(1)-. If you use -if- to partition a data set of N observations into C chunks, you are doing C O(N) operations. To use -in-, you pay a one time toll of O(N log N), in return for the ability to then complete the task with C O(1) operations. Depending on C and N, that can result in a substantial reduction in time. Of course, for other configurations of C and N, the sort time will outweigh the savings. What I can tell you from my experience using this technique is that, at least for the C's and N's that typically come up, it's usually a very substantial gain. Also note that often in these situations, the data have already been sorted in order to do something else. In that case, with the O(N log N) toll prepaid, it's pure gain. Your mileage may vary.


                    That said, I do agree that being able to select observations with -if- or -in- in the -save- statement would be an desirable.

                    Added: It also dawns on me that Sergeo Correa (who wrote #8 above) has an -fsort- command that can do the sorts in O(N) time. So using that, in combination with replacing -if- with -in- should give a spectacular savings in time.
                    Last edited by Clyde Schechter; 24 Sep 2017, 17:07.

                    Comment


                    • #11
                      Interesting addition to #10. I decided to run
                      Code:
                      clear
                      set rmsg on
                      use stuff
                      clear
                      use in 1/100000 using stuff
                      clear
                      use stuff
                      on my Stata 14.2. Like feenberg and Sergio Correa, it takes just as long to read the 1/1000000 segment of the data set as it does to read the whole thing! But I used the same data set that I had used in #7. So it is not about the data set. It is about how -use- works in these two versions of Stata.

                      Comment


                      • #12
                        Interesting; I haven't heard any instances of Stata 15 being faster than Stata 14 on any commonly used program until this case.

                        By the way, I ran a timer on a modified version of fsort. This is only faster than plain -sort- if you have lots of obs. (20mm or above), mostly due to the extra overhead of Mata compared to C. However, it also saves the ranges that you require to pass to -in- in a matrix, which is useful. Code below:

                        #### partition.ado
                        #### (requires ftools)

                        Code:
                        cap pr drop partition
                        program define partition, rclass
                            syntax varlist, [Verbose]
                            loc verbose = ("`verbose'" != "")
                            mata: F = factor("`varlist'", "`touse'", `verbose', "", ., ., ., 0)
                            mata: F.panelsetup()
                            
                            loc sortedby : sortedby
                            if ("`sortedby'" != "`varlist'") {
                                qui ds, has(type numeric)
                                loc numvars `r(varlist)'
                                
                                qui ds, has(type string)
                                loc strvars `r(varlist)'
                                
                                if ("`numvars'" != "") {
                                        mata: st_view(data = ., ., "`numvars'")
                                        mata: data[.,.] = F.sort(data)
                                }
                        
                                if ("`strvars'" != "") {
                                        mata: st_sview(data = ., ., "`strvars'")
                                        mata: data[.,.] = F.sort(data)
                                }
                            }
                            
                            tempname num_levels info
                            mata: st_numscalar("`num_levels'", F.num_levels)
                            mata: st_matrix("`info'", F.info)
                            mata: printf("{txt}(dataset sorted into %g partitions)\n", F.num_levels)
                            mata: mata drop F
                            return scalar num_levels = `num_levels'
                            return matrix info = `info'
                        end
                        
                        ftools, check
                        exit
                        #### example

                        Code:
                        sysuse auto, clear
                        
                        partition turn
                        return list
                        matrix list r(info)

                        Comment

                        Working...
                        X