Good afternoon,
From looking at code written by Stata Corp and experienced Stata programmers, I see that the style is generally that they restrict the calculation on the subsample of interest by creating a touse variable, and then every statement in the programme finishes with " if `touse' ".
To me this looks awkward and unnatural, and also my intuition suggest that this code should be very slow to execute. To me the natural way of restricting on the subsample of interest is to start with
-preserve- (keep a copy of the original data)
-keep if in- (drop what we do not need for our calculations)
do all the calculations without any conditions, no ifs, no ins.
-restore- (put the data in the state we found it).
Apart from being awkward to put in every statement " if `touse' ", I thought that my plan would execute faster because
1. we scan the data only once for the relevant subsample, keep it, and we are done. On the other hand with the current style we need to scan the data on every line in our programme, which sounds like a lot of work for Stata.
2. once we drop what we do not need at the beginning of our programme, we would have a smaller dataset to operate on, which should make things easier and faster for Stata.
And yet I am empirically wrong. The preserve/restore construct is very slow, and the current awkward style beats it in speed.
Why is that?
My second and related question is, How is it possible when the preserve/restore facility is so slow, that the -postfile- facility is so fast?
From looking at code written by Stata Corp and experienced Stata programmers, I see that the style is generally that they restrict the calculation on the subsample of interest by creating a touse variable, and then every statement in the programme finishes with " if `touse' ".
To me this looks awkward and unnatural, and also my intuition suggest that this code should be very slow to execute. To me the natural way of restricting on the subsample of interest is to start with
-preserve- (keep a copy of the original data)
-keep if in- (drop what we do not need for our calculations)
do all the calculations without any conditions, no ifs, no ins.
-restore- (put the data in the state we found it).
Apart from being awkward to put in every statement " if `touse' ", I thought that my plan would execute faster because
1. we scan the data only once for the relevant subsample, keep it, and we are done. On the other hand with the current style we need to scan the data on every line in our programme, which sounds like a lot of work for Stata.
2. once we drop what we do not need at the beginning of our programme, we would have a smaller dataset to operate on, which should make things easier and faster for Stata.
And yet I am empirically wrong. The preserve/restore construct is very slow, and the current awkward style beats it in speed.
Why is that?
My second and related question is, How is it possible when the preserve/restore facility is so slow, that the -postfile- facility is so fast?
Comment