Drawing multiple random samples that are larger than underlying dataset

Henrik Zaunbrecher

Join Date: Jul 2015

Posts: 3
#1

Drawing multiple random samples that are larger than underlying dataset

07 Dec 2022, 16:49

Hi,

I currently work on project where I want to do the following:
I have a dataset with a large set of characteristics (f.e. age) of people in year t. I also have a file with a population forecast of how many people of age X there will be in year t+30. I now want to draw from my dataset to build a population that fits the forecasted demography to see, how the age changes of the population affect some other variables (assuming that all things other than the age composition of the population stay the same). That also implies oversampling some age groups (there are less old people now than there will be in the future).

One way of doing this is to expand the original dataset such that the issue, that the bsample command does not allow samples larger than the dataset it draws from, is not a problem anymore. Then use a loop that draws samples per age category and save these as separate datasets (I might use frames instead but being limited to 100 and the working memory will likely become an issue). Then append the files in the end to get the final dataset.
However, this feels more like a workaround than an efficient way of doing this so I wonder if there is a smarter/faster way to do this?

Bonus question: It seem like using the values from the population forecast that are stored in a seperate frame as input for the 'bsample' command works just fine. However doing the same with the 'sample' command, doesn't and I get '_frval found where number expected'. Any way to make this work?

Best
Henrik
Tags: None
Eric Makela

Join Date: Aug 2022

Posts: 45
#2

07 Dec 2022, 19:30

Hi Henrik, would utilizing the group-wise population growth rates to create analytical weights for your observations work for your research? This seems like the way to go about your problem, since you can then run your empirical models without loss of statistical validity.

For example, each observation in year t has a weight of 1. Analytical weights for year t+30 are [demographic group's % total population in t+30] / [demographic group's % total population in t]. Or am I missing some more complex factor here?
1 like
Comment
Henrik Zaunbrecher

Join Date: Jul 2015

Posts: 3
#3

12 Dec 2022, 15:49

That would work for some of the analysis but not all of the packages we want to use allow aweights. Ideally we would just want to create a new dataset for t+30 by sampling with replacement from the current dataset (while adhering to the demographic forecast).
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#4

12 Dec 2022, 19:01

Henrik -- I'd suggest taking a look at the community-contributed command -gsample- (see -ssc describe gsample-). It will happily take samples larger than the original number of observations, work with or without replacement, and allow stratification and clustering, among other features. It will create a new data set, or generate sample frequencies in a variable, from which you could create a sampled data set using -expand-. Let us know if this works for you.
1 like
Comment
Henrik Zaunbrecher

Join Date: Jul 2015

Posts: 3
#5

15 Dec 2022, 18:07

That's exactly what I was looking for! Thanks a ton!
Comment

Announcement

Drawing multiple random samples that are larger than underlying dataset

Comment

Comment

Comment

Comment