Drawing with replacement (e.g. bsample?) with resampling size greater than original _N

Tom Storwitz

Join Date: Nov 2019

Posts: 11
#1

Drawing with replacement (e.g. bsample?) with resampling size greater than original _N

15 Nov 2019, 10:35

Dear statalist-forum,

I have a hopefully straightforward question and would be very grateful for any help. I use Stata/SE 14.2.

I would like to draw with replacement from a subset of my data, where the subset is defined by a certain condition that is fulfilled by say n_1<_N observations. I would would to draw more than n_1 times with replacement from this subset of n_1 observations.

(Background: I have wealth data of size _N and a certain wealth value w that splits the wealth data into two halves. wealth < w is the condition that defines the above-mentioned subset. I want to create many synthetic datasets from these original wealth data, each of size _N, but I want to randomize in each synthetic dataset over whether I draw from the subset (probability n_1/N) in the first place, or from some theoretical distribution (probability 1-n_1/N). This is why it can occur that I want to draw more than n_1 times from the subset defined by wealth<w.)

I thought the command bsample with the if-condition for defining the subset to be be drawn from with replacement would be a handy option. However, a crucial restriction of bsample is that the number of draws must not be higher than the number of observations drawn from. To illustrate with a simple example, I get the following error:

Code:

. set obs 10 number of observations (_N) was 0, now 10 . gen index = _n . bsample 8 if index<3 resample size must not be greater than number of observations r(498);

In the above example (and against the background described above), what would be a short way to draw 8 times with replacement from the subset defined by index<3?

Looking forward to any advise,
Tom Storwitz
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

15 Nov 2019, 14:17

Here's how you can do this:

Code:

clear* // CREATE A DEMONSTRATION DATA SET set obs 500 set seed 1234 gen wealth = rgamma(5, 100000) // DEFINE A THRESHOLDED SUBSET // FOR ILLUSTRATIVE PURPOSES, THE BOTTOM QUARTILE xtile group = wealth, nq(4) gen byte subset = (group == 1) drop group count if subset local subset_size `r(N)' // NOW DEMONSTRATE SAMPLING WITH REPLACEMENT FROM THE SUBSET // BEGIN BY MAKING A SEPARATE COPY OF THE SUBSET frame copy default subset_frame frame change subset_frame keep if subset gen long link = _n assert inrange(link, 1, `subset_size') // RETURN TO THE FULL DATA SET AND START TO SAMPLE frame change default gen long link = runiformint(1, `subset_size') frlink m:1 link, frame(subset_frame) frget synthetic_wealth = wealth, from(subset_frame) // AND IF YOU HAVE TO DO THIS REPEATEDLY, SAY 3 TIMES: forvalues i = 1/3 { replace link = runiformint(1, `subset_size') frlink rebuild subset_frame frget synthetic_wealth`i' = wealth, from(subset_frame) }

Please read the Forum FAQ for excellent advice about improving your posts and increasing your chances of timely, helpful responses. In particular:

1. When you are asking for help with code you should post example data. And always use the -dataex- command to do that. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

2. If you are not using the current version of Stata (16) you are asked to state what version you are using. The code above requires version 16 to run. If you are using an earlier version of Stata, you can still use the same logic, putting the subset data into a tempfile instead of a frame and using -merge- instead of -frlink/frget-.
2 likes
Comment

Tom Storwitz

Join Date: Nov 2019
Posts: 11

16 Nov 2019, 11:04

Dear Clyde,
Thank you very much for your helpful response!
The Stata 16 feature frames seems great. As I wrote in my first post, I use Stata 14.2, so I tried to implement your logic, putting, as you said, the subset data into a tempfile (thanks a lot for this reference, too!).

So for others interested, this modified version using tempfile instead of frames should work for older versions:

Code:

clear*
//  CREATE A DEMONSTRATION DATA SET
set obs 500

set seed 1234

gen wealth = rgamma(5, 100000)

//  DEFINE A THRESHOLDED SUBSET
//  FOR ILLUSTRATIVE PURPOSES, THE BOTTOM QUARTILE
xtile group = wealth, nq(4)
gen byte subset = (group == 1)
drop group
count if subset
local subset_size `r(N)'


//  NOW DEMONSTRATE SAMPLING WITH REPLACEMENT FROM THE SUBSET

//  BEGIN BY MAKING A SEPARATE COPY OF THE SUBSET
preserve
keep if subset
drop subset
gen long link = _n
assert inrange(link, 1, `subset_size')
tempfile subset_tempfile
save `subset_tempfile'


//  RETURN TO THE FULL DATA SET AND START TO SAMPLE (repeatedly)
restore

rename wealth wealth_original
gen long link = .

forvalues i = 1/3 {
    replace link = runiformint(1, `subset_size') if subset ==1
    merge m:1 link using `subset_tempfile'
    rename wealth synthetic_wealth`i'
    drop if _merge==2
    drop _merge
}

Thanks again!
Best,
Tom

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

16 Nov 2019, 11:14

Thank you for posting your postfile-based solution!
Comment

Announcement

Drawing with replacement (e.g. bsample?) with resampling size greater than original _N

Comment

Comment

Comment