Regarding preserve restore command

Himani Srihan

Join Date: Apr 2020

Posts: 51
#1

Regarding preserve restore command

19 Nov 2024, 13:57

Hello,

I am currently facing three questions regarding the preserve and restore commands. I would be thankful for any clarifications on this.

(1) My first question is regarding example 1 in the STATA manual preserve.pdf . What exactly is the need for preserve in this example? If the user does not need the data back for further analysis, once she has the results, when she closes her dataset, the dataset will return to its original form unless she decides to save the new data.

(2) On page (1), under options, the manual states "preserve instructs restore to restore the data now, but not to cancel the restoration of the data again at program conclusion. If preserve is not specified, the scheduled restoration at program conclusion is canceled." What does this sentence imply? For every command, would we always not use the preserve command first and then the restore command.

(3) For my code, I am starting by cleaning the data. After I clean the data, I need to start doing different analysis that require the same clean data. I do not want to clean the data every time I work with different research questions, so I am using the preserve and restore commands in STATA. I am assuming the command would go something like:

************************************************** ***************************************

(clean the data)
*preserve the clean data
preserve
(research question 1 analysis)
*saving new research data from research question 1 analysis as a different file
save
*now I need to restore the clean data for my second research question
restore
*preserve again as I need to keep my original clean data
preserve
(research question 2 analysis)
*saving new research data from research question 2 analysis as a different file
save
*now restore the clean data
restore

************************************************** **********************************************

Once I close my dataset after my final command "restore", I wanted to confirm that the data would go back to its original form (the one that was there before I cleaned the data).

Thank you for your time!
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3120
#2

19 Nov 2024, 14:33

preserve/restore allows you to do whatever you want, but all is lost (if not saved) upon restore at which time the original data is restored unaltered.

I use it as you are all the time, especially when constructing a dataset.

frames is another way to do what you want.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#3

19 Nov 2024, 14:52

(1) My first question is regarding example 1 in the STATA manual preserve.pdf . What exactly is the need for preserve in this example? If the user does not need the data back for further analysis, once she has the results, when she closes her dataset, the dataset will return to its original form unless she decides to save the new data.

The -preserve- command is necessary because the code that follows it destroys the original data in order to make a calculation, and this is in the context of a program. In a program, as opposed to a do-file, one is typically writing code that will solve some particular problem, and is intended for general use. By general use, I mean that you do not know who the users of this program will be, and you cannot know what they plan to do with the results your program creates. The user may well need the original data back after the program terminates--and you can't know if this will be the case or not when you write the program. So the general framework for programs is that you avoid "side-effects." That is, when the user uses your program, they can assume that any results calculated will be returned in -r()- or -e()- (or occasionally elsewhere), or in new variables whose names have been specified in the program-call command, and the data set will not have changed from its pre-program state.

If this example were in a do-file, then the -preserve- would indeed be unneeded because you could simply chose in you do-file to exit without saving the modified version of the data. But in a -program- you are programming for some future user whose circumstances and needs you cannot anticipate.

(2) On page (1), under options, the manual states "preserve instructs restore to restore the data now, but not to cancel the restoration of the data again at program conclusion. If preserve is not specified, the scheduled restoration at program conclusion is canceled." What does this sentence imply? For every command, would we always not use the preserve command first and then the restore command.

I think what the manual says here, although accurate as a description of what Stata does, fails to make clear the purpose of this. When you -restore- your data, the copy of the original data that Stata made with the -preserve- command is, by default, abandoned. Suppose you want to do something like this:

Code:

use my_data, clear // DO SOME CALCULATIONS THAT ADD SOME NEW VARIABLES // TO THE DATA preserve // PERFORM A CALCULATION THAT DESTROYS THE AUGMENTED DATA restore preserve // PERFORM ANOTHER CALCULATION, STARTING FROM THE PREVIOUSLY AUGMENTED DATA, // THAT, AGAIN, DESTROYS THE DATA restore preserve // PERFORM YET ANOTHER CALCULATION, STARTING FROM THE PREVIOUSLY AUGMENTED DATA, // THAT, AGAIN, DESTROYS THE DATA ...

Notice that you have had to -preserve- the data three times here. Each of those requires a disk write operation*, which, if your data set is large, can be very time-consuming. Imagine, in fact, that instead of a sequence of three calculations, it was a calculation in a loop being iterated maybe thousands of times. You will be thrashing the disk till the cows come home! Wouldn't it make more sense to write the data to the disk just once and at each -restore-, just read in that same first copy? That-s what -restore, preserve- allows you to do. Rather than losing the -preserve-d data, -restore, preserve- brings that data back into memory, but also retains the -preserve-d copy for future use, or for automatic restoration at the end..

(3) For my code, I am starting by cleaning the data. After I clean the data, I need to start doing different analysis that require the same clean data. I do not want to clean the data every time I work with different research questions, so I am using the preserve and restore commands in STATA. I am assuming the command would go something like:...

That code will perform as you expect. But, in light of the explanation given for (2), a more efficient version would be:

Code:

(clean the data) *preserve the clean data preserve (research question 1 analysis) *saving new research data from research question 1 analysis as a different file save *now I need to restore the clean data for my second research question restore, preserve *preserve again as I need to keep my original clean data // preserve THIS COMMAND DELETED (research question 2 analysis) *saving new research data from research question 2 analysis as a different file save *now restore the clean data restore

*In the versions of Stata since -frame-s were introduced, the -preserve-d data may be written to a frame in memory rather than to disk. This will, in general, be faster than a disk write, although there may be considerable overhead negotiating with the operating system to get the required memory allocated. So the use of -restore, preserve- to reduce the number of -preserve- operations applied to the exact same data is still a good idea.

Edit: Crossed with #2.
1 like
Comment
Himani Srihan

Join Date: Apr 2020

Posts: 51
#4

20 Nov 2024, 07:02

Originally posted by Clyde Schechter View Post

The -preserve- command is necessary because the code that follows it destroys the original data in order to make a calculation, and this is in the context of a program. In a program, as opposed to a do-file, one is typically writing code that will solve some particular problem, and is intended for general use. By general use, I mean that you do not know who the users of this program will be, and you cannot know what they plan to do with the results your program creates. The user may well need the original data back after the program terminates--and you can't know if this will be the case or not when you write the program. So the general framework for programs is that you avoid "side-effects." That is, when the user uses your program, they can assume that any results calculated will be returned in -r()- or -e()- (or occasionally elsewhere), or in new variables whose names have been specified in the program-call command, and the data set will not have changed from its pre-program state.

If this example were in a do-file, then the -preserve- would indeed be unneeded because you could simply chose in you do-file to exit without saving the modified version of the data. But in a -program- you are programming for some future user whose circumstances and needs you cannot anticipate.

I think what the manual says here, although accurate as a description of what Stata does, fails to make clear the purpose of this. When you -restore- your data, the copy of the original data that Stata made with the -preserve- command is, by default, abandoned. Suppose you want to do something like this:

Code:

use my_data, clear // DO SOME CALCULATIONS THAT ADD SOME NEW VARIABLES // TO THE DATA preserve // PERFORM A CALCULATION THAT DESTROYS THE AUGMENTED DATA restore preserve // PERFORM ANOTHER CALCULATION, STARTING FROM THE PREVIOUSLY AUGMENTED DATA, // THAT, AGAIN, DESTROYS THE DATA restore preserve // PERFORM YET ANOTHER CALCULATION, STARTING FROM THE PREVIOUSLY AUGMENTED DATA, // THAT, AGAIN, DESTROYS THE DATA ...

Notice that you have had to -preserve- the data three times here. Each of those requires a disk write operation*, which, if your data set is large, can be very time-consuming. Imagine, in fact, that instead of a sequence of three calculations, it was a calculation in a loop being iterated maybe thousands of times. You will be thrashing the disk till the cows come home! Wouldn't it make more sense to write the data to the disk just once and at each -restore-, just read in that same first copy? That-s what -restore, preserve- allows you to do. Rather than losing the -preserve-d data, -restore, preserve- brings that data back into memory, but also retains the -preserve-d copy for future use, or for automatic restoration at the end..

That code will perform as you expect. But, in light of the explanation given for (2), a more efficient version would be:

Code:

(clean the data) *preserve the clean data preserve (research question 1 analysis) *saving new research data from research question 1 analysis as a different file save *now I need to restore the clean data for my second research question restore, preserve *preserve again as I need to keep my original clean data // preserve THIS COMMAND DELETED (research question 2 analysis) *saving new research data from research question 2 analysis as a different file save *now restore the clean data restore

*In the versions of Stata since -frame-s were introduced, the -preserve-d data may be written to a frame in memory rather than to disk. This will, in general, be faster than a disk write, although there may be considerable overhead negotiating with the operating system to get the required memory allocated. So the use of -restore, preserve- to reduce the number of -preserve- operations applied to the exact same data is still a good idea.

Edit: Crossed with #2.

Thank you for your detailed response, Clyde! It makes sense. I had one further question regarding point (2). I understand that the sample code which preserves the data three times is inefficient. Can you give an example code that uses the restore-preserve command on the same code?
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29956

20 Nov 2024, 09:26

Sure. Here's how the first block of code shown in #2 would be done using -restore, preserve-:

Code:

use my_data, clear

//  DO SOME CALCULATIONS THAT ADD SOME NEW VARIABLES
//  TO THE DATA

preserve
// PERFORM A CALCULATION THAT DESTROYS THE AUGMENTED DATA
restore, preserve

// PERFORM ANOTHER CALCULATION, STARTING FROM THE PREVIOUSLY AUGMENTED DATA,
// THAT, AGAIN, DESTROYS THE DATA
restore, preserve

// PERFORM YET ANOTHER CALCULATION, STARTING FROM THE PREVIOUSLY AUGMENTED DATA,
// THAT, AGAIN, DESTROYS THE DATA
...

With this code, the data is only written to disk once, with the -preserve- command, but is reused with each -restore, preserve- command.

Comment

Himani Srihan

Join Date: Apr 2020
Posts: 51

21 Nov 2024, 15:07

Originally posted by Clyde Schechter View Post

Sure. Here's how the first block of code shown in #2 would be done using -restore, preserve-:

Code:

use my_data, clear

// DO SOME CALCULATIONS THAT ADD SOME NEW VARIABLES
// TO THE DATA

preserve
// PERFORM A CALCULATION THAT DESTROYS THE AUGMENTED DATA
restore, preserve

// PERFORM ANOTHER CALCULATION, STARTING FROM THE PREVIOUSLY AUGMENTED DATA,
// THAT, AGAIN, DESTROYS THE DATA
restore, preserve

// PERFORM YET ANOTHER CALCULATION, STARTING FROM THE PREVIOUSLY AUGMENTED DATA,
// THAT, AGAIN, DESTROYS THE DATA
...

With this code, the data is only written to disk once, with the -preserve- command, but is reused with each -restore, preserve- command.

Thank you so much for your time on this. I understand this now!

Announcement