Drawing random sample from a large data set for each observation

Guest
#1

Drawing random sample from a large data set for each observation

06 Mar 2020, 10:06

Hi, I have a large dataset on course enrollment. Individual students take courses in different semesters. Observations are unique at the individual-semester-coursenum level. Individuals also have different graduation years ("cohort"). I would like to choose, for each individual, a random sample of individuals in their cohort of a different size ("total") that is different for each individual. The best possible way I can think of is to loop through the individual observations and use the randomtag command, and create a unique identifer for each value of random tag (possibly the unique identifer of the student) - so for example, I could use the following commands:

preserve
keep id cohort total
duplicates drop /* We now have one observation per individual */
sort id
local N=_N
set seed 1357
forvalues i = 1/`N' {
local id = id[`i']
local year = cohort[`i']
local groupsize = total[`i']
randomtag if cohort == `year', count(`groupsize') g(selected)
g randomgroup = .
replace randomgroup = `id'*selected
}
sort id
save randomgroups.dta
restore
sort id
merge id using randomgroups.dta

I'm wondering if there is a faster way to do this, rather than looping over individual observations to generate random samples one at a time. Thank you for your suggestions.
Tags: None
Guest
#2

06 Mar 2020, 10:22

I should mention that I realized my own error, and that randomgroup needs to be generated and set to missing before the loop.

preserve
keep id cohort total
duplicates drop /* We now have one observation per individual */
sort id
local N=_N
g randomgroup = .
set seed 1357
forvalues i = 1/`N' {
local id = id[`i']
local year = cohort[`i']
local groupsize = total[`i']
randomtag if cohort == `year', count(`groupsize') g(selected)
replace randomgroup = `id'*selected
}
sort id
save randomgroups.dta
restore
sort id
merge id using randomgroups.dta
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#3

06 Mar 2020, 10:32

I can't deal with this in detail at the moment, but I have a few ideas that might help: The varying size of "total" aside, this seems to be a matching without replacement problem. Searching StataList will reveal a fair number of questions on this topic. /site:statalist.org match "without replacement"/ I like the user-written program -calipmatch- for its ease of use, but it implements a lot of things that are irrelevant to your problem.

I would find out the maximum value of total for your data set, match each of your individuals with that maximum number of cohort-peers. Than, you could easily discard the excess (max - total) cohort peers for each individual. This presumes that you have enough observations to afford some excess matches.
Comment
Guest
#4

06 Mar 2020, 10:43

Hi Mike,

Thanks for your answer. I'm familiar with a number of matching without replacement commands, which is why I'm familiar with the user-generated randomtag command. The critical part of this problem is that the number that needs to be sampled varies by the individual observation. The following code actually works and runs very fast (modified from up above), so thanks for your help. Was just looking to see whether there was a command that was meant to handle the problem described above so I don't need to define the variables needed for each observation. The following code creates the randomly-drawn sample of a different size for each observation:

preserve
keep ucid cohort
duplicates drop /* We now have one observation per individual */
egen total_remaining = count(ucid), by (cohort)
local N=_N
set seed 1357
g randomgroup = .
sort ucid
forvalues i = 1/`N' {
global id = ucid[`i']
global year = cohort[`i']
global groupsize = total_remaining[`i']
randomtag if cohort == $year, count($groupsize) g(selected$id)
replace randomgroup = $id*selected$id if selected$id == 1
}
sort ucid
save randomgroups.dta
restore
Comment
Guest
#5

06 Mar 2020, 11:04

For anyone else who may benefit from this post, the last modification I made to the above code to accomplish what I have described earlier - generating a different randomly selected group for each individual from their cohort, where the size of the group and the cohort varies by individual, I want to note that I modified one more line: I removed the randomgroup variable, because it is possible for the same person to be chosen for the randomly selected group of two different individuals, and if so, the randomgroup variable would only hold the last id of the individual whose group they were selected for.. Therefore, I just store each individual's group under a different variable, "selected$id". Hope this helps someone.

And if someone has a suggestion on a command that accomplishes this for all individuals at the same time, I am open to suggestions.

preserve
keep ucid cohort
duplicates drop /* We now have one observation per individual */
egen total_remaining = count(ucid), by (cohort)
local N=_N
set seed 1357
sort ucid
forvalues i = 1/`N' {
global id = ucid[`i']
global year = cohort[`i']
global groupsize = total_remaining[`i']
randomtag if cohort == $year, count($groupsize) g(selected$id)
}
sort ucid
save randomgroups.dta
restore
Comment

Announcement

Drawing random sample from a large data set for each observation

Comment

Comment

Comment

Comment