Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regarding preserve restore command

    Hello,

    I am currently facing three questions regarding the preserve and restore commands. I would be thankful for any clarifications on this.

    (1) My first question is regarding example 1 in the STATA manual preserve.pdf . What exactly is the need for preserve in this example? If the user does not need the data back for further analysis, once she has the results, when she closes her dataset, the dataset will return to its original form unless she decides to save the new data.

    (2) On page (1), under options, the manual states "preserve instructs restore to restore the data now, but not to cancel the restoration of the data again at program conclusion. If preserve is not specified, the scheduled restoration at program conclusion is canceled." What does this sentence imply? For every command, would we always not use the preserve command first and then the restore command.

    (3) For my code, I am starting by cleaning the data. After I clean the data, I need to start doing different analysis that require the same clean data. I do not want to clean the data every time I work with different research questions, so I am using the preserve and restore commands in STATA. I am assuming the command would go something like:

    ************************************************** ***************************************

    (clean the data)
    *preserve the clean data
    preserve
    (research question 1 analysis)
    *saving new research data from research question 1 analysis as a different file
    save
    *now I need to restore the clean data for my second research question
    restore
    *preserve again as I need to keep my original clean data
    preserve
    (research question 2 analysis)
    *saving new research data from research question 2 analysis as a different file
    save
    *now restore the clean data
    restore

    ************************************************** **********************************************

    Once I close my dataset after my final command "restore", I wanted to confirm that the data would go back to its original form (the one that was there before I cleaned the data).

    Thank you for your time!

  • #2
    preserve/restore allows you to do whatever you want, but all is lost (if not saved) upon restore at which time the original data is restored unaltered.

    I use it as you are all the time, especially when constructing a dataset.

    frames is another way to do what you want.

    Comment


    • #3
      (1) My first question is regarding example 1 in the STATA manual preserve.pdf . What exactly is the need for preserve in this example? If the user does not need the data back for further analysis, once she has the results, when she closes her dataset, the dataset will return to its original form unless she decides to save the new data.
      The -preserve- command is necessary because the code that follows it destroys the original data in order to make a calculation, and this is in the context of a program. In a program, as opposed to a do-file, one is typically writing code that will solve some particular problem, and is intended for general use. By general use, I mean that you do not know who the users of this program will be, and you cannot know what they plan to do with the results your program creates. The user may well need the original data back after the program terminates--and you can't know if this will be the case or not when you write the program. So the general framework for programs is that you avoid "side-effects." That is, when the user uses your program, they can assume that any results calculated will be returned in -r()- or -e()- (or occasionally elsewhere), or in new variables whose names have been specified in the program-call command, and the data set will not have changed from its pre-program state.

      If this example were in a do-file, then the -preserve- would indeed be unneeded because you could simply chose in you do-file to exit without saving the modified version of the data. But in a -program- you are programming for some future user whose circumstances and needs you cannot anticipate.

      (2) On page (1), under options, the manual states "preserve instructs restore to restore the data now, but not to cancel the restoration of the data again at program conclusion. If preserve is not specified, the scheduled restoration at program conclusion is canceled." What does this sentence imply? For every command, would we always not use the preserve command first and then the restore command.
      I think what the manual says here, although accurate as a description of what Stata does, fails to make clear the purpose of this. When you -restore- your data, the copy of the original data that Stata made with the -preserve- command is, by default, abandoned. Suppose you want to do something like this:

      Code:
      use my_data, clear
      
      //  DO SOME CALCULATIONS THAT ADD SOME NEW VARIABLES
      //  TO THE DATA
      
      preserve
      // PERFORM A CALCULATION THAT DESTROYS THE AUGMENTED DATA
      restore
      
      preserve
      // PERFORM ANOTHER CALCULATION, STARTING FROM THE PREVIOUSLY AUGMENTED DATA,
      // THAT, AGAIN, DESTROYS THE DATA
      restore
      
      preserve
      // PERFORM YET ANOTHER CALCULATION, STARTING FROM THE PREVIOUSLY AUGMENTED DATA,
      // THAT, AGAIN, DESTROYS THE DATA
      ...
      Notice that you have had to -preserve- the data three times here. Each of those requires a disk write operation*, which, if your data set is large, can be very time-consuming. Imagine, in fact, that instead of a sequence of three calculations, it was a calculation in a loop being iterated maybe thousands of times. You will be thrashing the disk till the cows come home! Wouldn't it make more sense to write the data to the disk just once and at each -restore-, just read in that same first copy? That-s what -restore, preserve- allows you to do. Rather than losing the -preserve-d data, -restore, preserve- brings that data back into memory, but also retains the -preserve-d copy for future use, or for automatic restoration at the end..

      (3) For my code, I am starting by cleaning the data. After I clean the data, I need to start doing different analysis that require the same clean data. I do not want to clean the data every time I work with different research questions, so I am using the preserve and restore commands in STATA. I am assuming the command would go something like:...
      That code will perform as you expect. But, in light of the explanation given for (2), a more efficient version would be:
      Code:
      (clean the data)
      *preserve the clean data
      preserve
      (research question 1 analysis)
      *saving new research data from research question 1 analysis as a different file
      save
      *now I need to restore the clean data for my second research question
      restore, preserve
      *preserve again as I need to keep my original clean data
      // preserve THIS COMMAND DELETED
      (research question 2 analysis)
      *saving new research data from research question 2 analysis as a different file
      save
      *now restore the clean data
      restore
      *In the versions of Stata since -frame-s were introduced, the -preserve-d data may be written to a frame in memory rather than to disk. This will, in general, be faster than a disk write, although there may be considerable overhead negotiating with the operating system to get the required memory allocated. So the use of -restore, preserve- to reduce the number of -preserve- operations applied to the exact same data is still a good idea.

      Edit: Crossed with #2.

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        The -preserve- command is necessary because the code that follows it destroys the original data in order to make a calculation, and this is in the context of a program. In a program, as opposed to a do-file, one is typically writing code that will solve some particular problem, and is intended for general use. By general use, I mean that you do not know who the users of this program will be, and you cannot know what they plan to do with the results your program creates. The user may well need the original data back after the program terminates--and you can't know if this will be the case or not when you write the program. So the general framework for programs is that you avoid "side-effects." That is, when the user uses your program, they can assume that any results calculated will be returned in -r()- or -e()- (or occasionally elsewhere), or in new variables whose names have been specified in the program-call command, and the data set will not have changed from its pre-program state.

        If this example were in a do-file, then the -preserve- would indeed be unneeded because you could simply chose in you do-file to exit without saving the modified version of the data. But in a -program- you are programming for some future user whose circumstances and needs you cannot anticipate.


        I think what the manual says here, although accurate as a description of what Stata does, fails to make clear the purpose of this. When you -restore- your data, the copy of the original data that Stata made with the -preserve- command is, by default, abandoned. Suppose you want to do something like this:

        Code:
        use my_data, clear
        
        // DO SOME CALCULATIONS THAT ADD SOME NEW VARIABLES
        // TO THE DATA
        
        preserve
        // PERFORM A CALCULATION THAT DESTROYS THE AUGMENTED DATA
        restore
        
        preserve
        // PERFORM ANOTHER CALCULATION, STARTING FROM THE PREVIOUSLY AUGMENTED DATA,
        // THAT, AGAIN, DESTROYS THE DATA
        restore
        
        preserve
        // PERFORM YET ANOTHER CALCULATION, STARTING FROM THE PREVIOUSLY AUGMENTED DATA,
        // THAT, AGAIN, DESTROYS THE DATA
        ...
        Notice that you have had to -preserve- the data three times here. Each of those requires a disk write operation*, which, if your data set is large, can be very time-consuming. Imagine, in fact, that instead of a sequence of three calculations, it was a calculation in a loop being iterated maybe thousands of times. You will be thrashing the disk till the cows come home! Wouldn't it make more sense to write the data to the disk just once and at each -restore-, just read in that same first copy? That-s what -restore, preserve- allows you to do. Rather than losing the -preserve-d data, -restore, preserve- brings that data back into memory, but also retains the -preserve-d copy for future use, or for automatic restoration at the end..


        That code will perform as you expect. But, in light of the explanation given for (2), a more efficient version would be:
        Code:
        (clean the data)
        *preserve the clean data
        preserve
        (research question 1 analysis)
        *saving new research data from research question 1 analysis as a different file
        save
        *now I need to restore the clean data for my second research question
        restore, preserve
        *preserve again as I need to keep my original clean data
        // preserve THIS COMMAND DELETED
        (research question 2 analysis)
        *saving new research data from research question 2 analysis as a different file
        save
        *now restore the clean data
        restore
        *In the versions of Stata since -frame-s were introduced, the -preserve-d data may be written to a frame in memory rather than to disk. This will, in general, be faster than a disk write, although there may be considerable overhead negotiating with the operating system to get the required memory allocated. So the use of -restore, preserve- to reduce the number of -preserve- operations applied to the exact same data is still a good idea.

        Edit: Crossed with #2.


        Thank you for your detailed response, Clyde! It makes sense. I had one further question regarding point (2). I understand that the sample code which preserves the data three times is inefficient. Can you give an example code that uses the restore-preserve command on the same code?

        Comment


        • #5
          Sure. Here's how the first block of code shown in #2 would be done using -restore, preserve-:

          Code:
          use my_data, clear
          
          //  DO SOME CALCULATIONS THAT ADD SOME NEW VARIABLES
          //  TO THE DATA
          
          preserve
          // PERFORM A CALCULATION THAT DESTROYS THE AUGMENTED DATA
          restore, preserve
          
          // PERFORM ANOTHER CALCULATION, STARTING FROM THE PREVIOUSLY AUGMENTED DATA,
          // THAT, AGAIN, DESTROYS THE DATA
          restore, preserve
          
          // PERFORM YET ANOTHER CALCULATION, STARTING FROM THE PREVIOUSLY AUGMENTED DATA,
          // THAT, AGAIN, DESTROYS THE DATA
          ...
          With this code, the data is only written to disk once, with the -preserve- command, but is reused with each -restore, preserve- command.

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            Sure. Here's how the first block of code shown in #2 would be done using -restore, preserve-:

            Code:
            use my_data, clear
            
            // DO SOME CALCULATIONS THAT ADD SOME NEW VARIABLES
            // TO THE DATA
            
            preserve
            // PERFORM A CALCULATION THAT DESTROYS THE AUGMENTED DATA
            restore, preserve
            
            // PERFORM ANOTHER CALCULATION, STARTING FROM THE PREVIOUSLY AUGMENTED DATA,
            // THAT, AGAIN, DESTROYS THE DATA
            restore, preserve
            
            // PERFORM YET ANOTHER CALCULATION, STARTING FROM THE PREVIOUSLY AUGMENTED DATA,
            // THAT, AGAIN, DESTROYS THE DATA
            ...
            With this code, the data is only written to disk once, with the -preserve- command, but is reused with each -restore, preserve- command.
            Thank you so much for your time on this. I understand this now!

            Comment

            Working...
            X