Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating multiple bootstrapped samples

    I'd like to create, say, B=100 bootstrapped samples of a dataset. I was a little surprised to learn that the -bsample- command won't do that. -bsample- will only create one boostrapped sample, with a sample size no bigger than _N. And who would want just one bootstrapped sample?

    I can create 100 bootstrapped sample by putting -bsample- inside loop, but it doesn't seem like users should have to do that. Is there a command that can create 100 bootstrapped samples with a single of code?

    (I do know that the -bootstrap-: prefix will generate the samples and analyze them automatically, but I have a slightly different use for the bootstrap, so that's not going to solve my problem.

  • #2
    Bootstrap samples are most commonly generated either by -bootstrap- or by -simulate-. Both work the same, you write up a program which does one run of the bootstrap, and then you use -bootstrap- or -simulate- to repeatedly call the program.

    See the manual of the two mentioned commands.

    Comment


    • #3
      Originally posted by paulvonhippel View Post
      And who would want just one bootstrapped sample?
      I dare say you want one sample (at a time) in the vast majority of situations. I do not know what you are up to, but usually, you would loop over the created samples in one way or the other. I am having a hard time imagining why, instead of drawing the samples inside a loop one at a time, you would want to draw all samples first, then loop over the resulting samples. Also, where would you want to store those samples? In separate datasets? In frames (possible only since Stata 16)? As variables, indicating the frequency weight with which each observation is drawn? As additional observations, added to the current dataset?

      Comment


      • #4
        daniel klein : I've been stacking up all B=100 (say) bootstrapped samples in a single dataset, as in the code below. It's convenient, though it does get a little large. Is this unusual? Is it more common to generate the bootstrapped samples one at a time, and discard each one after it's been analyzed? I note that stacking up all the simulated samples is common in multiple imputation, but maybe the conventions are difficult in the bootstrap community.
        Code:
        local B 1000
        forvalues b = 1/`B' {
         use chile_1_school, clear
         bsample
         gen b=`b'
         if `b'==1 {
          save boot_chile, replace
         }
         else {
          qui append using boot_chile
          qui save boot_chile, replace
         }
        }

        Comment


        • #5
          You mysteriously stated that you "have a slightly different use for the bootstrap, so" the way how people normally do bootstrap is "not going to solve your problem."

          Then you should carry on with whatever is solving your problem.

          But people do not do what you do, because the purpose of bootstrap is to create a bootstrap sample, calculate replica of the statistics of interest from this bootstrap sample, and save the statistics.

          Then the process repeats B times so that you end up with B replicas of your collection of statistics.

          The bootstrap samples are of no interest in themselves except that they are from where one derives the replica statistics. Therefore nobody is saving the bootstrap samples like you have done.

          But there is nothing wrong in what you have done, now you can calculate whatever you need to calculate by b, and you have your collection of bootstrap replicas.

          Comment


          • #6
            Thanks for getting back.

            Honestly, I have no idea what is common and what is not. That probably depends on the software that you are using, on personal preference, and on the specific application.

            Concerning software, Joro has already pointed out that in Stata, you would typically write a program that does whatever you want to do, then feed this program to bootstrap. This essentially boils down to drawing one sample at a time and discard the sample after the program has finished. From this perspective, it makes sense that bsample draws one sample at a time; its approach mirrors that of bootstrap.

            Personally, I would also tend to set up a single loop to draw the sample and do whatever I would like to do with that sample before moving on to the next iteration. Perhaps I would even write a program to draw the sample, etc., then call this program within a loop. I imagine that the resulting code would be both easy to write and easy to follow.

            Concerning the specific application, I still have no idea where you are going. You seem to have something in mind that would not be considered "common". Stata obviously has various ways in which you can store the samples first. Perhaps this approach makes the most sense for your application. Perhaps you just like to think about your problem in this way. That is perfectly fine.

            Comment


            • #7
              I'd agree with Joro and Daniel as to not being clear why you would want to save the data (as opposed to the statistic(s)) from each sample, but saving just the data can be done with the -bootstrap- command and a minor piece of trickery, namely that the program called by -bootstrap- doesn't calculate anything, but just saves the sample data.
              Code:
              cap prog drop bootsave
              prog bootsave, rclass
                 syntax,  outfile(string)
                 replace repid = $repcount
                 // "something" just exists to satisfy -bootstrap-
                 return scalar something = -1
                 append using `outfile'
                 save `outfile', replace
                 global repcount = $repcount + 1
              end
              //
              // Set up a file to hold the data from each bootstrapped sample
              clear
              save c:/temp/mybootdata.dta, emptyok replace
              //
              // Example using auto.dta
              sysuse auto, clear
              // I avoid globals like the plague--perhaps an odd turn of phrase these days--
              // but I can't think of how else to count/id each sample rep.
              global repcount = 0
              gen long repid = .  // will identify each rep's sample
              bootstrap r(something), noisily reps(5):  bootsave, outfile("c:/temp/mybootdata.dta")
              // Get the data for all the bootstrap samples
              clear
              use mybootdata.dta
              tab repid
              drop if repid == 0 // id for original sample, not wanted
              Note that one could just as well--actually better--work with a data file containing only an observation id variable for the original sample, save bootstrap samples of only that id, and then -merge m:1- the original data onto the collected samples of bootstrap ids when done.

              Comment


              • #8
                Thanks all. I'm combining the bootstrap with multiple imputation, in the manner described by von Hippel & Bartlett (Statistical Science, 2020).

                There's a bit of a culture clash here, as multiple imputation keeps all the imputed datasets, but bootstrap software tends to generate and discard them one at a time. That might be because multiple imputation is more computationally intensive than the bootstrap, so generating and regenerating imputed datasets as needed may seem wasteful.

                Comment


                • #9
                  thanks for the draft (I have not read it yet); in the past, when I wanted to do this I used either Method 1 or Method 2 from
                  Schomaker, M and Heumann, C (2018), "Bootstrap inference when using multiple imputation", _Statistics in Medicine_, 37: 2252-2266 which I don't see in the reference list of your draft

                  Comment


                  • #10
                    in addition to the article above, you might want to look at https://stats.idre.ucla.edu/stata/fa...-imputed-data/

                    Comment


                    • #11
                      Thanks, Rich Goldstein. I shared the arXiv version of von Hippel & Bartlett (2020), but I should have clarified that the article has been published by Statistical Science. Here's a citation with a link that doesn't seem to be paywalled: You're correct that other recipes have been proposed for bootstrapped multiple imputation, but many are computationally intensive and don't guarantee consistent standard errors or nominal coverage rates, particularly when the imputation and analysis models are uncongenial or misspecified. Barlett & Hughes (2020) evaluated different proposals through simulation and concluded that "only the Boot MI percentile (with moderate M) and von Hippel approaches give intervals with nominal coverage.... An advantage of the von Hippel approach is that it is far less computationally costly." Here's a reference with a non-paywalled link: Jonathan has also posted a YouTube presentation summarizing these results on his blog.

                      To prevent confusion, I should say that Jonathan still calls it the "von Hipppel approach," even though the article that derives it is by von Hippel & Bartlett. That's probably because I wrote the early drafts of the article myself, and the bootstrap MI method was already there when he found it, before he joined me as a coauthor.

                      I'd be interested in understanding your interest in bootstrapped multiple imputation and the situations where you find it useful. You can test the command I'm writing if you like.

                      Comment


                      • #12
                        paulvonhippel thanks; I am currently away for a long weekend <grin> but will look at these cites next week; in general, my interest in this issue has to do with internal validation of an MI model

                        Comment


                        • #13
                          Rich Goldstein : Got it. You can reach me any time through the contact form at paulvonhippel.com

                          Comment

                          Working...
                          X