Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Drawing with replacement (e.g. bsample?) with resampling size greater than original _N

    Dear statalist-forum,

    I have a hopefully straightforward question and would be very grateful for any help. I use Stata/SE 14.2.

    I would like to draw with replacement from a subset of my data, where the subset is defined by a certain condition that is fulfilled by say n_1<_N observations. I would would to draw more than n_1 times with replacement from this subset of n_1 observations.

    (Background: I have wealth data of size _N and a certain wealth value w that splits the wealth data into two halves. wealth < w is the condition that defines the above-mentioned subset. I want to create many synthetic datasets from these original wealth data, each of size _N, but I want to randomize in each synthetic dataset over whether I draw from the subset (probability n_1/N) in the first place, or from some theoretical distribution (probability 1-n_1/N). This is why it can occur that I want to draw more than n_1 times from the subset defined by wealth<w.)

    I thought the command bsample with the if-condition for defining the subset to be be drawn from with replacement would be a handy option. However, a crucial restriction of bsample is that the number of draws must not be higher than the number of observations drawn from. To illustrate with a simple example, I get the following error:

    Code:
    . set obs 10
    number of observations (_N) was 0, now 10
    
    . gen index = _n
    
    . bsample 8 if index<3
    resample size must not be greater than number of observations
    r(498);
    In the above example (and against the background described above), what would be a short way to draw 8 times with replacement from the subset defined by index<3?

    Looking forward to any advise,
    Tom Storwitz

  • #2
    Here's how you can do this:
    Code:
    clear*
    
    //  CREATE A DEMONSTRATION DATA SET
    set obs 500
    
    set seed 1234
    
    gen wealth = rgamma(5, 100000)
    
    //  DEFINE A THRESHOLDED SUBSET
    //  FOR ILLUSTRATIVE PURPOSES, THE BOTTOM QUARTILE
    xtile group = wealth, nq(4)
    gen byte subset = (group == 1)
    drop group
    count if subset
    local subset_size `r(N)'
    
    //  NOW DEMONSTRATE SAMPLING WITH REPLACEMENT FROM THE SUBSET
    
    //  BEGIN BY MAKING A SEPARATE COPY OF THE SUBSET
    frame copy default subset_frame
    frame change subset_frame
    keep if subset
    gen long link = _n
    assert inrange(link, 1, `subset_size')
    
    //  RETURN TO THE FULL DATA SET AND START TO SAMPLE
    frame change default
    gen long link = runiformint(1, `subset_size')
    frlink m:1 link, frame(subset_frame)
    frget synthetic_wealth = wealth, from(subset_frame)
    
    //  AND IF YOU HAVE TO DO THIS REPEATEDLY, SAY 3 TIMES:
    forvalues i = 1/3 {
        replace link = runiformint(1, `subset_size')
        frlink rebuild subset_frame
        frget synthetic_wealth`i' = wealth, from(subset_frame)
    }
    Please read the Forum FAQ for excellent advice about improving your posts and increasing your chances of timely, helpful responses. In particular:

    1. When you are asking for help with code you should post example data. And always use the -dataex- command to do that. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    2. If you are not using the current version of Stata (16) you are asked to state what version you are using. The code above requires version 16 to run. If you are using an earlier version of Stata, you can still use the same logic, putting the subset data into a tempfile instead of a frame and using -merge- instead of -frlink/frget-.

    Comment


    • #3
      Dear Clyde,
      Thank you very much for your helpful response!
      The Stata 16 feature frames seems great. As I wrote in my first post, I use Stata 14.2, so I tried to implement your logic, putting, as you said, the subset data into a tempfile (thanks a lot for this reference, too!).

      So for others interested, this modified version using tempfile instead of frames should work for older versions:

      Code:
      clear*
      //  CREATE A DEMONSTRATION DATA SET
      set obs 500
      
      set seed 1234
      
      gen wealth = rgamma(5, 100000)
      
      //  DEFINE A THRESHOLDED SUBSET
      //  FOR ILLUSTRATIVE PURPOSES, THE BOTTOM QUARTILE
      xtile group = wealth, nq(4)
      gen byte subset = (group == 1)
      drop group
      count if subset
      local subset_size `r(N)'
      
      
      //  NOW DEMONSTRATE SAMPLING WITH REPLACEMENT FROM THE SUBSET
      
      //  BEGIN BY MAKING A SEPARATE COPY OF THE SUBSET
      preserve
      keep if subset
      drop subset
      gen long link = _n
      assert inrange(link, 1, `subset_size')
      tempfile subset_tempfile
      save `subset_tempfile'
      
      
      //  RETURN TO THE FULL DATA SET AND START TO SAMPLE (repeatedly)
      restore
      
      rename wealth wealth_original
      gen long link = .
      
      forvalues i = 1/3 {
          replace link = runiformint(1, `subset_size') if subset ==1
          merge m:1 link using `subset_tempfile'
          rename wealth synthetic_wealth`i'
          drop if _merge==2
          drop _merge
      }

      Thanks again!
      Best,
      Tom

      Comment


      • #4
        Thank you for posting your postfile-based solution!

        Comment

        Working...
        X