random sampling over values of a specified variable

ebony bridwell-mitchell

Join Date: Jun 2014

Posts: 19
#1

random sampling over values of a specified variable

09 Oct 2016, 15:23

Dear Statalilst:

I have a dataset of 1.3 million observations of partnerships between two sets of organizations, i and j, where the variable tie==1 indicates i and j have a partnership and tie==0 indicates no partnership. For most i-j observations, tie==0 because most organizations do not partner with each other. For analytical purposes, I need to keep all the i-j observations of tie==1 and sample a subset of the i-j observations of tie==0 . Because I need the subsample to be stratified by a key grouping variable the syntax I am using is bysort group: sample n if tie ! == 1, count. This results in a dataset that includes all the original observations of tie==1 and n observations of tie==0 within each group. THE PROBLEM is I need an n subample of tie==0 observations for every observation of tie==1. In other words, for an organization, i, with two partnerships I need a subsample that include the two tie==1 observations but also includes n*2 observations of tie==0 (i.e. 14 zeros if the sampling n=7); likewise, if an organization has three partnerships, I need the subsample to include the 3 tie==1 observations and 21 tie==0 observations. Intuitively, the syntax I'd want is bysort group: sample n*realized_count if tie ! == 1, count, where "realized_count" is a variable I've generated for the total number of ties for each organization i. It seems that I can't use the multiplication operation (*) or a variable name (realized_count) with sample so this syntax won't work. Alternatively, it seems like I should be able to accomplish my aim with some kind of foreach or forvalue loop but I also can't get this to work either. Any advice?

Thank you, in advance, Ebony
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

09 Oct 2016, 16:03

I'm not sure I understand what you're saying, but it sounds like you need to keep all of the tie==1 observations and then randomly sample a number of tie == 0 observations equal to the number of tie == 1 observations within each value of group. If I have this right, then the code below should do it. The first part of the code just creates a toy data set to develop and test the code.

Code:

// CREATE TOY DATA clear* set obs 5 gen group = _n expand 50 by group, sort: gen obs_no = _n set seed 1234 gen byte tie = runiform() < 0.2 sort group obs_no list in 1/20, noobs clean // START WITH ALL TIE = 1 OBSERVATIONS tempfile building preserve keep if tie == 1 by group (obs_no), sort: gen pick_order = _n rename obs_no obs_no1 drop tie save `building' // NOW RANDOMLY ORDER THE TIE = 0 OBSERVATIONS WITHIN GROUP restore keep if tie == 0 gen double shuffle1 = runiform() gen double shuffle2 = runiform() sort group shuffle1 shuffle2 by group: gen pick_order = _n rename obs_no obs_no0 drop tie // NOW MERGE THE DATA SETS merge 1:1 group pick_order using `building', keep(match) nogenerate drop shuffle* reshape long obs_no, i(group pick_order) j(tie) drop pick_order sort group tie obs_no

Notes:
1. Since you don't need to generate toy data, you can leave out that part. BUT, there is one key piece you need. The -set seed- command is still necessary to initialize the random number generator, so that your results will be reproducible. So when you eliminate that first block of code, remember to put a -set seed- command in there somewhere before the calls to the -runiform()- function.

2. The use of two double precision random numbers to randomly sort the data is massively overkill for this small toy data set. But for a data set with 1.3 million observations, you really might need that to assure that the sort order is uniquely determined.
Comment
ebony bridwell-mitchell

Join Date: Jun 2014

Posts: 19
#3

09 Oct 2016, 16:46

Thank you Clyde for your speedy reply. I haven't yet tried the code but noticed your interpretation of my problem is missing one detail, which may affect the code. Your wrote, "...and then randomly sample a number of tie == 0 observations equal to the number of tie == 1 observations within each value of group." What I need to do is randomly sample a number of tie == 0 observations equal to [n] * the number of tie == 1 observations within each value of group. So if there are 3 tie==1 observations and n=7 then I need 21 tie==0 observations. Note, I will be taking different subsamples so that "n" might be 7 or 10 or 25%. Please let me know if this affects your suggested syntax. Thank you.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#4

09 Oct 2016, 17:09

Sorry, but I don't get what n is and I guess I don't understand what you're doing. I think the best thing is for you to post a small sample of your data that illustrates the possibilities, and then also show what the results from that sample should look like. Please use -dataex- to show your example data so that it can be faithfully and effortlessly be reproduced in Stata. If you don't already have it, you can get it by running -ssc install dataex-. Follow the instructions in -help dataex- to use it.
Comment
ebony bridwell-mitchell

Join Date: Jun 2014

Posts: 19
#5

10 Oct 2016, 09:37

Here is an illustration of the data structure, illustrated using dataex but modified to highlight the structure of the observations while minimizing the number of rows. To conceptualize the full dataset imagine that each block/group (separated with an empty row for illustrative purposes) has 1095 additional rows of id_school id_partner combinations and three additional time periods (i.e. six time periods total).

input long id_school int id_partner byte(period tie)
171445 952 1 0
171445 404 1 1
171445 1203 1 0
171445 797 1 0
171445 361 1 0

171445 952 2 0
171445 404 2 1
171445 1203 2 1
171445 797 2 0
171445 361 2 0

171445 952 3 1
171445 404 3 1
171445 1203 3 1
171445 797 3 0
171445 361 3 0

171490 952 1 1
171490 404 1 0
171490 1203 1 0
171490 797 1 0
171490 361 1 0

171490 952 2 1
171490 404 2 0
171490 1203 2 0
171490 797 2 0
171490 361 2 0

171490 952 3 1
171490 404 3 1
171490 1203 3 1
171490 797 3 0
171490 361 3 0

What I need is to randomly sample a subset of the tie==0 observations for each group, where group is defined by id_school period. So, let’s say I wanted a subset of 2 (n=2) tie == 0 observations. The code I would use is "bysort id_school period: sample 2 if tie! == 1, count". The result for the data is below. Note that there are two tie=0 observations in every block/group.

input long id_school int id_partner byte(period tie)
171445 952 1 0
171445 404 1 1
171445 1203 1 0

171445 952 2 0
171445 404 2 1
171445 1203 2 1
171445 797 2 0

171445 952 3 1
171445 404 3 1
171445 1203 3 1
171445 797 3 0
171445 361 3 0

171490 952 1 1
171490 797 1 0
171490 361 1 0

171490 952 2 1
171490 404 2 0
171490 1203 2 0

171490 952 3 1
171490 404 3 1
171490 1203 3 1
171490 797 3 0
171490 361 3 0

BUT I am trying to accomplish something slightly different. What I need is to sample n (i.e. n=2) tie==0 observations for every tie == 1 observation. An example of the desired data is below (note: for the example, I used the imaginary 1095 rows to illustrate the full, which means there are more observations below than in the initial illustration above). Notice how in block/group 1 there are two tie==0 observations because there is one tie==1 observation but in block/group 2 there are four tie==0 observations because there are two tie==1 observations. Likewise in block/group 3 there are six tie==0 observations.

input long id_school int id_partner byte(period tie)
171445 952 1 0
171445 404 1 1
171445 1203 1 0

171445 952 2 0
171445 404 2 1
171445 1203 2 1
171445 797 2 0
171445 361 2 0
171445 123 2 0

171445 952 3 1
171445 404 3 1
171445 1203 3 1
171445 797 3 0
171445 361 3 0
171445 123 3 0
171445 223 3 0
171445 323 3 0
171445 423 3 0

171490 952 1 1
171490 404 1 0
171490 1203 1 0

171490 952 2 1
171490 404 2 0
171490 1203 2 0

171490 952 3 1
171490 404 3 1
171490 1203 3 1
171490 797 3 0
171490 361 3 0
171490 123 3 0
171490 223 3 0
171490 323 3 0
171490 423 3 0

It seems like the most straightforward way to accomplish the above is to use the conventional code (bysort id_school period: sample 2 if tie! == 1, count) but then create some kind of loop so that the code would be executed for each group for whatever the total tie count is for that group (i.e. once for block/group 1, twice for block/group 2 and three times for block/group 3). However, I can’t figure out how to correctly construct the loop.

Thank you for your help.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#6

10 Oct 2016, 10:24

Cross-posted at http://stackoverflow.com/questions/3...iable-in-stata

Please note our cross-posting policy, which is that you should tell us about it. http://www.statalist.org/forums/help#crossposting
Comment
ebony bridwell-mitchell

Join Date: Jun 2014

Posts: 19
#7

10 Oct 2016, 10:27

Thanks - cross-posted in an abbreviated version to stackoverflow.
Comment
ebony bridwell-mitchell

Join Date: Jun 2014

Posts: 19
#8

10 Oct 2016, 10:31

... http://stackoverflow.com/questions/3...iable-in-stata
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

10 Oct 2016, 10:34

Seems like this is better answered using mocked-up data. You need to first count, per id_school period group how many ties you have. With that information, all you need is to randomly order observations within groups and pick out the desired number of observations. Note that in the following, some groups have no partners and are excluded from the final sample.

Code:

* create 10 schools, each with 1095 partners over 3 periods
clear
set seed 21341234
set obs 10
gen long id_school = _n
expand 1095
gen int id_partner = _n
expand 3
bysort id_school id_partner: gen byte period = _n

* create a small number of partner ties per id_school period group
isid id_school period id_partner, sort
by id_school period: gen byte tie = runiformint(1,1095) < 3

* count the number of partners per id_school period group
by id_school period: egen npartners = total(tie)
by id_school period: gen byte sp_tag = _n == 1
tab npartners if sp_tag

* by id_school period, sample all tie==1 and 2*npartners with tie==0
gen double mixitup = runiform()
isid id_school period tie mixitup id_partner, sort
by id_school period tie: gen byte pick = tie == 1 | _n <= (2 * npartners)

keep if pick

Comment

ebony bridwell-mitchell

Join Date: Jun 2014

Posts: 19
#10

10 Oct 2016, 11:48

Eureka! I had gotten as far as counting the total number of ties per group (see my attempt at bysort group: sample n*realized_count if tie ! == 1, count in post #1) but couldn't figure out the rest! Thank you Robert and Statalist!

Last edited by ebony bridwell-mitchell; 10 Oct 2016, 11:59.
Comment

Announcement

random sampling over values of a specified variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment