Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating and saving bootstrap samples for

    I have an estimation procedure which requires bootstrapped standard errors. My dataset is 2.7M observations, ~3500 clusters, and the estimation procedure takes a long time per sample (the supercomputer takes roughly 15hr to bootstrap 20 replications using Stata's bootstrap). To make better use of my time and supercomputing access, I want to essentially do the bootstrap command in pieces.

    First, I want to draw 500 (or more) samples with replacement in a replicable manner (setting seed, etc.), and save each one. Then, I'll run an "array job" of my estimation procedure on each saved sample. Finally, I'll compute standard errors as Stata's bootstrap does.

    On the 3rd sampling of my dataset, I receive this error message:

    Code:
    I/O error writing .dta file
        Usually such I/O errors are caused by the disk or file system being full.
    r(693);

    Because (I think) Stata has some sort of temporary file associated with each sample drawn? I am not sure how to proceed. What's odd about this, is that I've successfully used preserve, restore, and save in this looping manner on much larger datasets, but never before with the bsample command.

    Though my own dataset fails on the 3rd sample, you can replicate the error on the first sample using below code (at least on my machine). This being a "large" dataset problem, I did not use dataex.

    Code:
    clear all
    sysuse auto
    
    expand 37000
    
    global test ""
    
    set seed 1234 // Statalist-seed
    forvalues i = 1(1)5 {
        // preserve dataset
        preserve
        
        // draw a random number to advance the seedstate
        gen random = runiform()
        
        //bootstrap sample it
        bsample, cluster(make) idcluster(make2)
        
        // attach seedstate
        gen seedstate = c(seed)
        
        // save repcount
        gen dataset_id = `i'
        
        // drop random number
        drop random
        
        // save dataset
        save "$test/sample_`i'", replace
        
        // save seedstate
        keep seedstate
        duplicates drop
        save "$test/seedstate_`i'", replace
        
        // restore dataset
        restore
        
    }

  • #2
    I cannot replicate your problem. The code you show ran with no error messages on my system, which is neither unusually fast nor especially capacious,
    Code:
    Stata/MP 18.0 for Windows (64-bit x86-64)
    Revision 13 Jul 2023
    Copyright 1985-2023 StataCorp LLC
    
    Total physical memory:       32.00 GB
    Available physical memory:   22.50 GB
    
    Stata license: Single-user 4-core  perpetual
    and successfully saved all of the files.

    I wonder if the problem you are encountering is not well described by the error message. Although it suggests considering a full disk or file system, it is actually a non-specific write error message. It may be that you are creating these files to be saved faster than the operating system can digest them and pass them on to the disk. If the OS's file buffers are full when another request to write to disk arrives, the OS will refuse and pass an error back to Stata, which Stata reports with r(693). So a possibility is that you are asking the OS to write file n+1 when it is still trying to write file n, and it is choking. This kind of situation is particularly common if you are saving your files to a network drive. You can slow down this process by inserting a -sleep- command before your -save- instruction: this usually requires some experimentation to find out how long a sleep period is needed to relieve the write bottleneck.

    Comment


    • #3
      Hi there
      The problem is the line

      gen seedstate = c(seed)

      c(seed) is being stored as a long String, of length 5011. So you are asking to store that on Every dataset for the 2.5million observations.
      Too much.

      Perhaps you may want to store that as a note in the dataset

      Code:
      clear all
      sysuse auto
      
      expand 5000 
      
      global test "."
      
      set seed 1234 // Statalist-seed
      forvalues i = 1(1)5 {
          // preserve dataset
          preserve
          
          // draw a random number to advance the seedstate
          gen random = runiform()
          
          //bootstrap sample it
          bsample, cluster(make) idcluster(make2)
          
          // attach seedstate
          note : `c(seed)'
          
          // save repcount
          gen dataset_id = `i'
          
          // drop random number
          drop random
          
          // save dataset
          save "$test/sample_`i'", replace
          
        
          // restore dataset
          restore
          
      }
      Since you are initializing the Seed, I don't think you need to save each individual dataset "state".
      F

      Comment


      • #4
        Since you are initializing the Seed, I don't think you need to save each individual dataset "state".
        Well, if everything goes well, it is unnecessary. But bootstrap samples sometimes prove to be unanalyzable, or produce anomalous results that require investigation. It is useful to be able to reproduce those particular data sets without having to re-run the entire bootstraping process from the beginning. So storing the random number generator state along the way is a good practice.

        That said, since the data sets themselves are being saved, I don't see how saving the seed on top of that is helpful. And I agree that if you are going to save the seed and the data set, doing it as a note in the data set makes more sense than as a variable. Even so, strL's do a pretty good job of conserving memory. Saving a 5,000 character strL in each observation of the data set does not expand the size of the data set by 5000*_N:
        Code:
        . clear*
        
        . sysuse auto
        (1978 automobile data)
        
        .
        . memory
        
        Memory usage
                                                 Used                Allocated
        ----------------------------------------------------------------------
        Data                                    3,182               67,108,864
        strLs                                       0                        0
        ----------------------------------------------------------------------
        Data & strLs                            3,182               67,108,864
        
        ----------------------------------------------------------------------
        Data & strLs                            3,182               67,108,864
        Variable names, %fmts, ...              4,370                   71,230
        Overhead                            1,081,344                1,082,136
        
        Stata matrices                              0                        0
        ado-files                               8,873                    8,873
        Stored results                              0                        0
        
        Mata matrices                               0                        0
        Mata functions                              0                        0
        
        set maxvar usage                    5,281,738                5,281,738
        
        Other                                   4,884                    4,884
        ----------------------------------------------------------------------
        Total                               6,378,879               73,557,725
        
        .
        . gen seed = "`c(seed)'" in 1
        (73 missing values generated)
        
        . memory
        
        Memory usage
                                                 Used                Allocated
        ----------------------------------------------------------------------
        Data                                    3,774               67,108,864
        strLs                                   5,092                    5,092
        ----------------------------------------------------------------------
        Data & strLs                            8,866               67,113,956
        
        ----------------------------------------------------------------------
        Data & strLs                            8,866               67,113,956
        Variable names, %fmts, ...              4,711                   71,230
        Overhead                            1,081,344                1,082,136
        
        Stata matrices                              0                        0
        ado-files                               8,873                    8,873
        Stored results                              0                        0
        
        Mata matrices                               0                        0
        Mata functions                              0                        0
        
        set maxvar usage                    5,281,738                5,281,738
        
        Other                                   4,884                    4,884
        ----------------------------------------------------------------------
        Total                               6,384,904               73,562,817
        
        .
        . replace seed = "`c(seed)'"
        (73 real changes made)
        
        . memory
        
        Memory usage
                                                 Used                Allocated
        ----------------------------------------------------------------------
        Data                                    3,774               67,108,864
        strLs                                  11,912                   11,912
        ----------------------------------------------------------------------
        Data & strLs                           15,686               67,120,776
        
        ----------------------------------------------------------------------
        Data & strLs                           15,686               67,120,776
        Variable names, %fmts, ...              4,711                   71,230
        Overhead                            1,081,344                1,082,136
        
        Stata matrices                              0                        0
        ado-files                               8,873                    8,873
        Stored results                              0                        0
        
        Mata matrices                               0                        0
        Mata functions                              0                        0
        
        set maxvar usage                    5,281,738                5,281,738
        
        Other                                   4,884                    4,884
        ----------------------------------------------------------------------
        Total                               6,391,724               73,569,637
        Adding 73 copies of the 5,011 byte seed to the data set only required an additional 6,820 bytes of memory. (I think, but do not know, that in fact only a single copy of the seed itself is stored and it is handled by storing pointers to that in the variable values.)

        Comment


        • #5
          Clyde Schechter , thank you for the sleep suggestion. I think you are likely correct, as I am currently writing to a OneDrive folder. I ran into the same error on different code which didn't produce the error before, so i am trying this suggestion there and testing if it works.

          FernandoRios, I wasn't aware of the note function and will definitely make use of that instead. That will work significantly better, thank you. I was mainly saving it so as to have 'proof' of seedstate advancement, and this note is precisely what i need.

          Comment


          • #6
            Thank you Clyde Schechter
            Point taken. I can see why saving the State to generate a particular Bootstrap would be better.
            I definitely have no idea how efficient StrL format works. I was making parallelism with str only. In any case, on my computer, that was the bottle neck.
            Not necessarily for the data saved in disk, but for the computer memory required, even tho it was not allocated into Stata.

            Best wishes.
            Fernando

            Comment

            Working...
            X