Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to "rollback" Stata commands?

    Dear Colleagues,

    When using large data sets, I frequently find myself in the following situation: I open the data set, add or modify some variables, realize I made a mistake, and want to revert back to the original data set but do not want to have to waste time opening the entire data set all over again. Sometimes I can identify which variables have been created and drop them, but this is tedious, error-prone, and doesn't always get back to the original condition of the data set. preserve and restore, of course, are designed for this purpose, but they are slower than re-opening the data set.

    Is there any other way in Stata to reverse the effects of commands that change the data set? The only way I could think of was to use cmdlog and then parse the command log (is there another way to get at previously executed commands), go back to the last use and then start dropping variables that had been generated and perhaps in some cases reverse the effects of a replace. Of course, recovering dropped variables is probably out of the question. Does anybody think that such a command would be useful? To be really useful, StataCorp would probably have to implement it, but would a limited version be useful in the meantime?

    Regards,
    Joe

  • #2
    How about entering your commands in a do file? In this way, you track any changes that you have made and can go to a specific line in the do file where you think that you made a mistake.

    Comment


    • #3
      Andrew,

      Thanks for the suggestion. I do most of my work in do-files, partly for that reason, and, yes, it is possible to fix the mistake and execute the program starting at that point. That works OK when initially debugging a program, but when making changes to the program (e.g., adding/subtracting features, fixing logic errors, etc.) it can start to get kind of complicated. Accordingly, I prefer to start with a blank slate (i.e., the newly opened data set) rather than trying to figure out where to start executing the code so as to encompass all changes but also avoiding "variable already exists" errors and other discontinuities.

      Regards,
      Joe

      Comment


      • #4
        Seems pretty much like a no-go to me. You're already following best practices and know about preserve. One thing that *might* help is that whenever you create a new set of variables, have a statement like "capture drop newvar*" before creating them. That way, you can more conveniently backtrack.

        One thing to make it less painful that comes to my mind is optimizing hardware. Have you maxed out your available RAM and/or are you using a solid-state hard drive? Also, there was a discussion a while back about RAMdisks, which, if you have enough RAM, can speed things up.

        Comment


        • #5
          Joe, this doesn't solve your question, but is probably something conceptually very close to what you want:
          http://www.stata.com/meeting/boston1...14-fiedler.pdf
          see the slides on the notebook interface.

          In general however, if you type clear all, Stata will clear all, and this is irreversible. Preserve/restore can minimize damage to the dataset, but only if you use them. They will not insure against other effects of the commands you type (effects on matrices, mata, locals, program definitions, etc).

          Stata has a convention, that if there is a syntax error in the command, this command is aborted and the data is not affected. User-written commands are also advised to follow this convention.
          However, if there is no syntax error in a command, it gets executed, and this is irreversible, unless you can figure out the reverse-ing command (e.g. drop for gen).

          In the world of graphic editors the UNDO feature is implemented:
          1) for raster editors - have a full copy of the canvas in memory; revert to it if necessary;
          2) for the vector editors - have all the edits recorded; to undo - remove the last edit, clear the canvas, and replay from the beginning.
          The first is equivalent of preserve/restore.
          The second is equivalent to a do-file.

          I'd strongly advocate working with do-files. Typical critique is that the work would slow down if one needs to rerun the file from beginning on a large dataset - just trim the dataset until the program is written, then work on the full dataset. And if the file becomes too long, split it into several individual subfiles and make them work independently.

          Hope this helps. Best, Sergiy Radyakin

          Comment


          • #6
            I have found when working with "large" datasets that it can be helpful to do my initial exploratory work on a subset of the data. The code I develop on the subset becomes a good starting point when I turn my hand to working on the full dataset. Developing models on a reasonably well chosen subset often yields good starting points for the full dataset. Of course all this assumes that the data structure is not so complicated that it becomes difficult to select a subset with a stochastic structure similar to the full data.

            Comment


            • #7
              Ben,

              Thanks for the suggestion of "capture drop"; that could certainly be useful in some situations. We generally use network drives, so there's not much I can do about the speed.

              Sergiy,

              I'm not sure I'm ready to tackle Python, but that looks like an interesting possibility; I had forgotten about that part of James' presentation.

              Sergiy & William,

              Thanks for the suggestion regarding testing on a subset. That would certainly solve a lot of the speed problems, provided one was foresightful enough to do that ahead of time.

              Regards,
              Joe

              Comment


              • #8
                Ah. Network drives. Is it *possible* to copy a temporary copy of the dataset(s) over to your local hard drive? Obviously for keeping data safe both from destruction and system compromise/permissions issues, network drives are better. But even with Giagabit Ethernet, I've seen load times for large datasets cut by probably a hundredfold by working off the local hard drive, even just a regular spinning platter, not even an SSD. You might need special permission from your SysAdmin to do so, but running things locally is dramatically faster with big data.

                Comment


                • #9
                  It's been a while since you asked this, and you probably already have a better answer or already tried this, but have you tried using snapshot?

                  Comment


                  • #10
                    Carlos,

                    Thanks for that suggestion; I was revisiting this issue and just saw your comment. I was not familiar with snapshot, but it looks like a good alternative to preserve/restore for interactive use. It turns out, my original post notwithstanding, that preserve/restore (and probably snapshot) are good alternatives in our situation because they use a temporary disk that is much faster than the network drive we use for normal storage. Moreover, restore is somewhat faster than use because less error-checking is required.

                    Regards,
                    Joe

                    Comment


                    • #11
                      Joe,

                      I'm glad snapshot is a good option for you. I like preserve and restore when I'm working with code that I'm sure already works. But I prefer snapshot if I'll be trying stuff using the command line. I usually compress the data after loading it to memory, create a snapshot of it before I make any changes, and then create more snapshots after applying changes that take Stata a while to process. It's really like working with Mac OS's Time Machine, or with file versions in Windows or Dropbox.

                      Best wishes,
                      Carlos

                      Comment

                      Working...
                      X