Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Drawing random sample from a large data set for each observation

    Hi, I have a large dataset on course enrollment. Individual students take courses in different semesters. Observations are unique at the individual-semester-coursenum level. Individuals also have different graduation years ("cohort"). I would like to choose, for each individual, a random sample of individuals in their cohort of a different size ("total") that is different for each individual. The best possible way I can think of is to loop through the individual observations and use the randomtag command, and create a unique identifer for each value of random tag (possibly the unique identifer of the student) - so for example, I could use the following commands:

    preserve
    keep id cohort total
    duplicates drop /* We now have one observation per individual */
    sort id
    local N=_N
    set seed 1357
    forvalues i = 1/`N' {
    local id = id[`i']
    local year = cohort[`i']
    local groupsize = total[`i']
    randomtag if cohort == `year', count(`groupsize') g(selected)
    g randomgroup = .
    replace randomgroup = `id'*selected
    }
    sort id
    save randomgroups.dta
    restore
    sort id
    merge id using randomgroups.dta

    I'm wondering if there is a faster way to do this, rather than looping over individual observations to generate random samples one at a time. Thank you for your suggestions.

  • #2
    I should mention that I realized my own error, and that randomgroup needs to be generated and set to missing before the loop.

    preserve
    keep id cohort total
    duplicates drop /* We now have one observation per individual */
    sort id
    local N=_N
    g randomgroup = .
    set seed 1357
    forvalues i = 1/`N' {
    local id = id[`i']
    local year = cohort[`i']
    local groupsize = total[`i']
    randomtag if cohort == `year', count(`groupsize') g(selected)
    replace randomgroup = `id'*selected
    }
    sort id
    save randomgroups.dta
    restore
    sort id
    merge id using randomgroups.dta

    Comment


    • #3
      I can't deal with this in detail at the moment, but I have a few ideas that might help: The varying size of "total" aside, this seems to be a matching without replacement problem. Searching StataList will reveal a fair number of questions on this topic. /site:statalist.org match "without replacement"/ I like the user-written program -calipmatch- for its ease of use, but it implements a lot of things that are irrelevant to your problem.

      I would find out the maximum value of total for your data set, match each of your individuals with that maximum number of cohort-peers. Than, you could easily discard the excess (max - total) cohort peers for each individual. This presumes that you have enough observations to afford some excess matches.

      Comment


      • #4
        Hi Mike,

        Thanks for your answer. I'm familiar with a number of matching without replacement commands, which is why I'm familiar with the user-generated randomtag command. The critical part of this problem is that the number that needs to be sampled varies by the individual observation. The following code actually works and runs very fast (modified from up above), so thanks for your help. Was just looking to see whether there was a command that was meant to handle the problem described above so I don't need to define the variables needed for each observation. The following code creates the randomly-drawn sample of a different size for each observation:

        preserve
        keep ucid cohort
        duplicates drop /* We now have one observation per individual */
        egen total_remaining = count(ucid), by (cohort)
        local N=_N
        set seed 1357
        g randomgroup = .
        sort ucid
        forvalues i = 1/`N' {
        global id = ucid[`i']
        global year = cohort[`i']
        global groupsize = total_remaining[`i']
        randomtag if cohort == $year, count($groupsize) g(selected$id)
        replace randomgroup = $id*selected$id if selected$id == 1
        }
        sort ucid
        save randomgroups.dta
        restore

        Comment


        • #5
          For anyone else who may benefit from this post, the last modification I made to the above code to accomplish what I have described earlier - generating a different randomly selected group for each individual from their cohort, where the size of the group and the cohort varies by individual, I want to note that I modified one more line: I removed the randomgroup variable, because it is possible for the same person to be chosen for the randomly selected group of two different individuals, and if so, the randomgroup variable would only hold the last id of the individual whose group they were selected for.. Therefore, I just store each individual's group under a different variable, "selected$id". Hope this helps someone.

          And if someone has a suggestion on a command that accomplishes this for all individuals at the same time, I am open to suggestions.

          preserve
          keep ucid cohort
          duplicates drop /* We now have one observation per individual */
          egen total_remaining = count(ucid), by (cohort)
          local N=_N
          set seed 1357
          sort ucid
          forvalues i = 1/`N' {
          global id = ucid[`i']
          global year = cohort[`i']
          global groupsize = total_remaining[`i']
          randomtag if cohort == $year, count($groupsize) g(selected$id)
          }
          sort ucid
          save randomgroups.dta
          restore

          Comment

          Working...
          X