I have an estimation procedure which requires bootstrapped standard errors. My dataset is 2.7M observations, ~3500 clusters, and the estimation procedure takes a long time per sample (the supercomputer takes roughly 15hr to bootstrap 20 replications using Stata's bootstrap). To make better use of my time and supercomputing access, I want to essentially do the bootstrap command in pieces.
First, I want to draw 500 (or more) samples with replacement in a replicable manner (setting seed, etc.), and save each one. Then, I'll run an "array job" of my estimation procedure on each saved sample. Finally, I'll compute standard errors as Stata's bootstrap does.
On the 3rd sampling of my dataset, I receive this error message:
Because (I think) Stata has some sort of temporary file associated with each sample drawn? I am not sure how to proceed. What's odd about this, is that I've successfully used preserve, restore, and save in this looping manner on much larger datasets, but never before with the bsample command.
Though my own dataset fails on the 3rd sample, you can replicate the error on the first sample using below code (at least on my machine). This being a "large" dataset problem, I did not use dataex.
First, I want to draw 500 (or more) samples with replacement in a replicable manner (setting seed, etc.), and save each one. Then, I'll run an "array job" of my estimation procedure on each saved sample. Finally, I'll compute standard errors as Stata's bootstrap does.
On the 3rd sampling of my dataset, I receive this error message:
Code:
I/O error writing .dta file Usually such I/O errors are caused by the disk or file system being full. r(693);
Because (I think) Stata has some sort of temporary file associated with each sample drawn? I am not sure how to proceed. What's odd about this, is that I've successfully used preserve, restore, and save in this looping manner on much larger datasets, but never before with the bsample command.
Though my own dataset fails on the 3rd sample, you can replicate the error on the first sample using below code (at least on my machine). This being a "large" dataset problem, I did not use dataex.
Code:
clear all sysuse auto expand 37000 global test "" set seed 1234 // Statalist-seed forvalues i = 1(1)5 { // preserve dataset preserve // draw a random number to advance the seedstate gen random = runiform() //bootstrap sample it bsample, cluster(make) idcluster(make2) // attach seedstate gen seedstate = c(seed) // save repcount gen dataset_id = `i' // drop random number drop random // save dataset save "$test/sample_`i'", replace // save seedstate keep seedstate duplicates drop save "$test/seedstate_`i'", replace // restore dataset restore }
Comment