Randomly Sampling Groups of Observations

Thomas Beatty

Join Date: May 2015

Posts: 24
#1

Randomly Sampling Groups of Observations

27 Jul 2015, 16:41

Hello everyone,

I've run into a problem trying to randomly sample a part of my dataset (to make up a control group for econometric analysis). I'm trying to randomly sample 63 schools from, let's say a total of 500. I can easily set the seed and get that sample but the problem is that each school has 11 observations to its name, one for each year of data. Ideally I would like to keep all eleven observations for each of the 63 randomly selected schools, giving me 693 observations at the end of the day, but I haven't been able to figure out a command to do so. I've read up on sample2 and bsample but they each seem to focus on individual observations. Is there an option for either of those commands that I'm overlooking that will do the trick?

For those who work better with visuals, here is a table with three schools, multiple observations each. Let's say I'd like to randomly sample two out of the three schools but keep the rest of their observations. How might I do this?

Observation School Year

1 A 1995

2 A 1996

3 A 1997

4 A 1998

5 B 1995

6 B 1996

7 B 1997

8 B 1998

9 C 1995

10 C 1996

11 C 1997

12 C 1998

Thanks,

Thomas

Stata Version: 13
Tags: None

Robert Picard

Join Date: Mar 2014
Posts: 1536

27 Jul 2015, 17:43

I suggest that you try randomtag from SSC. To install, type in Stata's command window

Code:

ssc install randomtag

The trick to pick a random sample of schools is to identify one observation per school that will represent the school. Then you pick a random sample of those representative observations. The following example shows how to do it with randomtag and using Stata's own sample command. Note that random tag is significantly faster at taking the sample and does not change the data in memory.

Code:

clear
set obs 500
gen school = _n
expand 11
bysort school: gen year = 2000 + _n
tempfile f
save "`f'"

* pick one observation to represent each school
egen school1 = tag(school)

* select a random sample from the tagged obs. this
* requires
set seed 2134123
randomtag if school1, count(63) gen(t)

* keep all observations from picked schools
bysort school: egen select = total(t)

sum if select

* show how to use sample to generate the same
set seed 2134123
keep if school1
sample 63, count
keep school
merge 1:m school using "`f'"

sum if _merge == 3

Comment

Gary Longton

Join Date: Apr 2014

Posts: 12
#3

27 Jul 2015, 18:24

It sounds like you wish to sample without replacement, though your mention of bsample suggests that you might also be interested in sampling with replacement? bsample does include a cluster() option that would allow you to sample schools (with their corresponding observations) with replacement.

For sampling without replacement: Though randomtag looks like a userful command, here is an (untested) approach similar to Robert's that might be a little more transparent (doesn't require a user-defined command and avoids egen):

Code:

set seed 783489 local sampsize 63 // tag 1 observation from each school bys school: gen school_tag = _n==1 // generate random number in a way that can be replicated if necessary bys school_tag (year): gen rn = runiform() // randomly select the desired number of schools by sorting tagged school // obervations on the previously generated random number bys school_tag (rn): gen select = (school_tag == 1) & (_n <= `sampsize') // extend select indicator to all observations for the selected schools bys school (year): replace select = sum(select) bys school (year): replace select = select[_N]
2 likes
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#4

27 Jul 2015, 21:08

I'm all for using first principles to solve problems but that's usually because you trade off a few more commands for an increase in execution speed. The raison d'être of randomtag is that it does its thing without disturbing the data in memory AND without sorting the data in memory. While in this case the number of observations is low and it doesn't matter much, I would still encourage people to use randomtag because it is significantly more efficient than the first principles method suggested by Gary.

Here's a more sizable problem with both methods. I adjusted Gary's picking order to match the one used by Stata's sample command. All egen commands are also avoided.

Code:

clear set obs 100000 gen id = _n expand 20 bysort id: gen year = 1990 + _n tempfile f qui save "`f'" local sampsize 63 set seed 783489 timer clear * adjust first principles to choose observations in the same order timer on 1 bysort id (year): gen id_tag = _n==1 gen skip = !id_tag sort skip id year gen double rn = runiform() sort skip rn gen select = (_n <= `sampsize') bysort id (year): replace select = select[1] timer off 1 tempfile one qui save "`one'" * redo using -randomtag- (from SSC) set seed 783489 use "`f'", clear timer on 2 bysort id (year): gen id_tag = _n==1 randomtag if id_tag, count(63) gen(t) by id: gen select = t[1] drop t timer off 2 timer list * show that exactly the same observations are selected cf _all using "`one'", all

Without randomtag, sampling requires at least two sort commands. Since the expected efficiency of a sort is O(n log n), Gary's first principles solution becomes increasingly inefficient as the number of observations grows. Here are the timing results for the problem above:

Code:

. timer list 1: 3.91 / 1 = 3.9120 2: 0.29 / 1 = 0.2930

If Gary's "transparent" comment relates to a lack of trust in user-written programs, that's fine but better kept to oneself unless you have a reason to doubt that the program does what it says it does.
2 likes
Comment
Thomas Beatty

Join Date: May 2015

Posts: 24
#5

28 Jul 2015, 16:25

Thank you both for your feedback. I ended up using the randomtag program to do the sampling for me. Robert, the code you provided above was extremely easy to adjust to suit my large dataset that I'm working on. Gary, your approach looked nice and simple as well but I ended up going with randomtag because, as Robert has shown, it's much faster in the long run and I have do files that are already taking a few hours to execute. Nevertheless, thanks for your timely response. I'll have your method written down for use in the future and I'm sure other Stata users will find this post helpful.

Cheers,

Thomas

Last edited by Thomas Beatty; 28 Jul 2015, 16:28.
Comment
Nazzarena

Join Date: Aug 2014

Posts: 60
#6

17 Sep 2019, 11:57

I have a question similar to this and to the query in https://www.statalist.org/forums/for...ified-variable
I would like to draw a random sample, in such a manner that the same distribution of a continuous variable in the original dataset remain in the sample. So far the only idea I have is to calculate percentiles and draw randomly samples over each percentile, then stitch together the 100 draws in one file. Is there a more elegant way to do this?
Thank you very much.
Comment

Observation	School	Year
1	A	1995
2	A	1996
3	A	1997
4	A	1998
5	B	1995
6	B	1996
7	B	1997
8	B	1998
9	C	1995
10	C	1996
11	C	1997
12	C	1998

Announcement