Creating multiple bootstrapped samples

paulvonhippel

Join Date: Apr 2014

Posts: 502
#1

Creating multiple bootstrapped samples

17 Oct 2020, 14:27

I'd like to create, say, B=100 bootstrapped samples of a dataset. I was a little surprised to learn that the -bsample- command won't do that. -bsample- will only create one boostrapped sample, with a sample size no bigger than _N. And who would want just one bootstrapped sample?

I can create 100 bootstrapped sample by putting -bsample- inside loop, but it doesn't seem like users should have to do that. Is there a command that can create 100 bootstrapped samples with a single of code?

(I do know that the -bootstrap-: prefix will generate the samples and analyze them automatically, but I have a slightly different use for the bootstrap, so that's not going to solve my problem.
Tags: None
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

17 Oct 2020, 15:31

Bootstrap samples are most commonly generated either by -bootstrap- or by -simulate-. Both work the same, you write up a program which does one run of the bootstrap, and then you use -bootstrap- or -simulate- to repeatedly call the program.

See the manual of the two mentioned commands.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#3

18 Oct 2020, 02:24

Originally posted by paulvonhippel View Post

And who would want just one bootstrapped sample?

I dare say you want one sample (at a time) in the vast majority of situations. I do not know what you are up to, but usually, you would loop over the created samples in one way or the other. I am having a hard time imagining why, instead of drawing the samples inside a loop one at a time, you would want to draw all samples first, then loop over the resulting samples. Also, where would you want to store those samples? In separate datasets? In frames (possible only since Stata 16)? As variables, indicating the frequency weight with which each observation is drawn? As additional observations, added to the current dataset?
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#4

18 Oct 2020, 12:07

daniel klein : I've been stacking up all B=100 (say) bootstrapped samples in a single dataset, as in the code below. It's convenient, though it does get a little large. Is this unusual? Is it more common to generate the bootstrapped samples one at a time, and discard each one after it's been analyzed? I note that stacking up all the simulated samples is common in multiple imputation, but maybe the conventions are difficult in the bootstrap community.

Code:

local B 1000 forvalues b = 1/`B' { use chile_1_school, clear bsample gen b=`b' if `b'==1 { save boot_chile, replace } else { qui append using boot_chile qui save boot_chile, replace } }
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#5

18 Oct 2020, 12:49

You mysteriously stated that you "have a slightly different use for the bootstrap, so" the way how people normally do bootstrap is "not going to solve your problem."

Then you should carry on with whatever is solving your problem.

But people do not do what you do, because the purpose of bootstrap is to create a bootstrap sample, calculate replica of the statistics of interest from this bootstrap sample, and save the statistics.

Then the process repeats B times so that you end up with B replicas of your collection of statistics.

The bootstrap samples are of no interest in themselves except that they are from where one derives the replica statistics. Therefore nobody is saving the bootstrap samples like you have done.

But there is nothing wrong in what you have done, now you can calculate whatever you need to calculate by b, and you have your collection of bootstrap replicas.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#6

18 Oct 2020, 13:54

Thanks for getting back.

Honestly, I have no idea what is common and what is not. That probably depends on the software that you are using, on personal preference, and on the specific application.

Concerning software, Joro has already pointed out that in Stata, you would typically write a program that does whatever you want to do, then feed this program to bootstrap. This essentially boils down to drawing one sample at a time and discard the sample after the program has finished. From this perspective, it makes sense that bsample draws one sample at a time; its approach mirrors that of bootstrap.

Personally, I would also tend to set up a single loop to draw the sample and do whatever I would like to do with that sample before moving on to the next iteration. Perhaps I would even write a program to draw the sample, etc., then call this program within a loop. I imagine that the resulting code would be both easy to write and easy to follow.

Concerning the specific application, I still have no idea where you are going. You seem to have something in mind that would not be considered "common". Stata obviously has various ways in which you can store the samples first. Perhaps this approach makes the most sense for your application. Perhaps you just like to think about your problem in this way. That is perfectly fine.
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2416

18 Oct 2020, 21:55

I'd agree with Joro and Daniel as to not being clear why you would want to save the data (as opposed to the statistic(s)) from each sample, but saving just the data can be done with the -bootstrap- command and a minor piece of trickery, namely that the program called by -bootstrap- doesn't calculate anything, but just saves the sample data.

Code:

cap prog drop bootsave
prog bootsave, rclass
   syntax,  outfile(string)
   replace repid = $repcount
   // "something" just exists to satisfy -bootstrap-
   return scalar something = -1
   append using `outfile'
   save `outfile', replace
   global repcount = $repcount + 1
end
//
// Set up a file to hold the data from each bootstrapped sample
clear
save c:/temp/mybootdata.dta, emptyok replace
//
// Example using auto.dta
sysuse auto, clear
// I avoid globals like the plague--perhaps an odd turn of phrase these days--
// but I can't think of how else to count/id each sample rep.
global repcount = 0
gen long repid = .  // will identify each rep's sample
bootstrap r(something), noisily reps(5):  bootsave, outfile("c:/temp/mybootdata.dta")
// Get the data for all the bootstrap samples
clear
use mybootdata.dta
tab repid
drop if repid == 0 // id for original sample, not wanted

Note that one could just as well--actually better--work with a data file containing only an observation id variable for the original sample, save bootstrap samples of only that id, and then -merge m:1- the original data onto the collected samples of bootstrap ids when done.

Comment

paulvonhippel

Join Date: Apr 2014

Posts: 502
#8

19 Oct 2020, 11:56

Thanks all. I'm combining the bootstrap with multiple imputation, in the manner described by von Hippel & Bartlett (Statistical Science, 2020).

There's a bit of a culture clash here, as multiple imputation keeps all the imputed datasets, but bootstrap software tends to generate and discard them one at a time. That might be because multiple imputation is more computationally intensive than the bootstrap, so generating and regenerating imputed datasets as needed may seem wasteful.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4464
#9

19 Oct 2020, 13:19

thanks for the draft (I have not read it yet); in the past, when I wanted to do this I used either Method 1 or Method 2 from
Schomaker, M and Heumann, C (2018), "Bootstrap inference when using multiple imputation", _Statistics in Medicine_, 37: 2252-2266 which I don't see in the reference list of your draft
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4464
#10

20 Oct 2020, 12:28

in addition to the article above, you might want to look at https://stats.idre.ucla.edu/stata/fa...-imputed-data/
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#11

22 Oct 2020, 08:50

Thanks, Rich Goldstein. I shared the arXiv version of von Hippel & Bartlett (2020), but I should have clarified that the article has been published by Statistical Science. Here's a citation with a link that doesn't seem to be paywalled:
von Hippel, P.T. & Bartlett, J. (2020). “Maximum likelihood multiple imputation: Faster imputation and consistent standard errors without posterior draws.” Published online ahead of print, Statistical Science. Also available as arXiv e-print 1210.0870.

You're correct that other recipes have been proposed for bootstrapped multiple imputation, but many are computationally intensive and don't guarantee consistent standard errors or nominal coverage rates, particularly when the imputation and analysis models are uncongenial or misspecified. Barlett & Hughes (2020) evaluated different proposals through simulation and concluded that "only the Boot MI percentile (with moderate M) and von Hippel approaches give intervals with nominal coverage.... An advantage of the von Hippel approach is that it is far less computationally costly." Here's a reference with a non-paywalled link:
Bartlett, J., & Hughes, R. (2020). "Bootstrap Inference for Multiple Imputation under Uncongeniality and Misspecification." Statistical Methods in Medical Research. Volume 29, issue 12, pages 3533-3546.

Jonathan has also posted a YouTube presentation summarizing these results on his blog.

To prevent confusion, I should say that Jonathan still calls it the "von Hipppel approach," even though the article that derives it is by von Hippel & Bartlett. That's probably because I wrote the early drafts of the article myself, and the bootstrap MI method was already there when he found it, before he joined me as a coauthor.

I'd be interested in understanding your interest in bootstrapped multiple imputation and the situations where you find it useful. You can test the command I'm writing if you like.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4464
#12

22 Oct 2020, 11:02

paulvonhippel thanks; I am currently away for a long weekend <grin> but will look at these cites next week; in general, my interest in this issue has to do with internal validation of an MI model
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#13

23 Oct 2020, 13:40

Rich Goldstein : Got it. You can reach me any time through the contact form at paulvonhippel.com
Comment

Announcement