Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why is -preserve/restore- so slow, when in the same time -postfile- is so fast?

    Good afternoon,

    From looking at code written by Stata Corp and experienced Stata programmers, I see that the style is generally that they restrict the calculation on the subsample of interest by creating a touse variable, and then every statement in the programme finishes with " if `touse' ".

    To me this looks awkward and unnatural, and also my intuition suggest that this code should be very slow to execute. To me the natural way of restricting on the subsample of interest is to start with
    -preserve- (keep a copy of the original data)
    -keep if in- (drop what we do not need for our calculations)
    do all the calculations without any conditions, no ifs, no ins.
    -restore- (put the data in the state we found it).

    Apart from being awkward to put in every statement " if `touse' ", I thought that my plan would execute faster because
    1. we scan the data only once for the relevant subsample, keep it, and we are done. On the other hand with the current style we need to scan the data on every line in our programme, which sounds like a lot of work for Stata.
    2. once we drop what we do not need at the beginning of our programme, we would have a smaller dataset to operate on, which should make things easier and faster for Stata.

    And yet I am empirically wrong. The preserve/restore construct is very slow, and the current awkward style beats it in speed.

    Why is that?

    My second and related question is, How is it possible when the preserve/restore facility is so slow, that the -postfile- facility is so fast?

  • #2
    preserve writes the file temporarily to disk, and writing and reading from disk is slow. As an alternative you could use frames, so you don't write to disk. This speeds things up, but just an if condition is still faster:

    Code:
    . sysuse auto, clear
    (1978 automobile data)
    
    . timer clear
    
    .
    . forvalues i = 1/1000 {
      2.     timer on 1
      3.     preserve
      4.     qui keep if foreign == 1
      5.     sum rep78, meanonly
      6.     restore
      7.     timer off 1
      8.
    .     timer on 2
      9.     tempname touse
     10.     frame put if foreign == 1, into(`touse')
     11.     frame change `touse'
     12.     sum rep78, meanonly
     13.     frame change default
     14.     frame drop `touse'
     15.     timer off 2
     16.     
    .     timer on 3
     17.     sum rep78 if foreign == 1, meanonly
     18.     timer off 3
     19. }
    
    .
    . timer list
       1:      1.27 /     1000 =       0.0013
       2:      0.29 /     1000 =       0.0003
       3:      0.04 /     1000 =       0.0000
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      According to documentation preserve now uses frames; at least in Stata/MP.*

      One of the problems with preserve is that once the original dataset is restored, useful results, e.g., e(sample) or predicted values, are wiped out. More generally, keeping track of which observations and variables were used in the calculations seems harder when the calculations are performed on a (potentially entirely) different dataset. Such considerations might be part of the reasons to choose the if touse approach. But I guess the main consideration usually is speed.

      I have no idea why postfile is comparatively fast and whether it has always been fast (prior to frames). It is built-in, meaning compiled C++ code, but whether that alone explains the speed, I cannot tell.


      * I wonder whether StataCorp. uses this purely to market MP or whether there is a technical reason why preserve cannot use frames in BE and SE.

      Comment


      • #4
        Thank you, Maarten Buis, for the very useful code. I considered the option of using frames, and tentatively decided against it because
        1. then my programme would not be usable in earlier than Stata 16.
        2. frames are not that fast, for example in doing Monte Carlo simulations I have found out that using -postfile- is faster than using frames.

        Your example is presenting frames in a too bad light, and the `if' approach in a too good light, because there are lots of creations of frames going on, and not that many lines of code where the `if' statement is imposed. The applications that I have in mind are writing my own (hopefully faster) programmes to do, regressions, ivregressions, multivariate regressions, etc. In these application the balance is different -- there is no that much creation of frames, but there are plenty of lines in the code which need to impose the `if' statement.

        Do you think that there is a way how to trick -postfile- to act as in your demonstration with the frames? I understand that the intention of -postfile- is different, it is for Monte Carlo simulations. But maybe there is a way how to -postfile- the current data, impose the restrictions and do the calculations, and then revert back to the posted file while avoiding writing to disk?

        Comment


        • #5
          I am running Stata MP 17 with two cores, daniel klein -- I moved to a rich university and they have spent some cash on Stata -- but -preserve/restore- is still slow as a turtle, as it has always been. So I do not know why Stata Corp are saying what they are saying in the documentation.

          Thank you for the other considerations that it is harder to keep track of what has been done when dealing with two datasets, and particularly setting the e(sample). This did not occur to me, I have to think this through more.

          Comment


          • #6
            My bad, Stata Corp are not lying. Indeed in Stata 17 MP the preserve/restore is as fast as frames, so they are using frames indeed.

            When I run Maarten's code in #2, I get the following timings:

            Code:
            .  timer list
               1:      0.16 /     1000 =       0.0002
               2:      0.18 /     1000 =       0.0002
               3:      0.03 /     1000 =       0.0000

            Comment


            • #7
              As an aside to this discussion, simply to highlight it's existence, there is also -frame post- which is the direct analogue of -postfile- which lets you accumulate results directly into a frame. It can be a alternative here, though not if your intent is to have backwards compatibility with versions of Stata prior to 16.

              Comment


              • #8
                Frames or not, preserve/restore copies quite a lot of data back and forth in addition to the 78 comparisons (in auto.dta) for the -if- qualifier. The other way has only the 78 comparisons and no copy. If there is an OS call for additional memory, that is expensive too. Preserve/restore isn't just changing a pointer like a COW filesystem. As for -postfile- my only experience with it has it writing regression results to a file, and the amount of data written was always very small compared to the underlying dataset.

                On the other hand, in the realm of "syntactic sugar" it would be nice if the -if- qualifier didn't have to be repeated for each command, but could be valid for a block enclosed in braces.

                Comment

                Working...
                X