Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Randomly Sampling Groups of Observations

    Hello everyone,

    I've run into a problem trying to randomly sample a part of my dataset (to make up a control group for econometric analysis). I'm trying to randomly sample 63 schools from, let's say a total of 500. I can easily set the seed and get that sample but the problem is that each school has 11 observations to its name, one for each year of data. Ideally I would like to keep all eleven observations for each of the 63 randomly selected schools, giving me 693 observations at the end of the day, but I haven't been able to figure out a command to do so. I've read up on sample2 and bsample but they each seem to focus on individual observations. Is there an option for either of those commands that I'm overlooking that will do the trick?

    For those who work better with visuals, here is a table with three schools, multiple observations each. Let's say I'd like to randomly sample two out of the three schools but keep the rest of their observations. How might I do this?
    Observation School Year
    1 A 1995
    2 A 1996
    3 A 1997
    4 A 1998
    5 B 1995
    6 B 1996
    7 B 1997
    8 B 1998
    9 C 1995
    10 C 1996
    11 C 1997
    12 C 1998

    Thanks,

    Thomas

    Stata Version: 13

  • #2
    I suggest that you try randomtag from SSC. To install, type in Stata's command window

    Code:
    ssc install randomtag
    The trick to pick a random sample of schools is to identify one observation per school that will represent the school. Then you pick a random sample of those representative observations. The following example shows how to do it with randomtag and using Stata's own sample command. Note that random tag is significantly faster at taking the sample and does not change the data in memory.

    Code:
    clear
    set obs 500
    gen school = _n
    expand 11
    bysort school: gen year = 2000 + _n
    tempfile f
    save "`f'"
    
    * pick one observation to represent each school
    egen school1 = tag(school)
    
    * select a random sample from the tagged obs. this
    * requires
    set seed 2134123
    randomtag if school1, count(63) gen(t)
    
    * keep all observations from picked schools
    bysort school: egen select = total(t)
    
    sum if select
    
    * show how to use sample to generate the same
    set seed 2134123
    keep if school1
    sample 63, count
    keep school
    merge 1:m school using "`f'"
    
    sum if _merge == 3

    Comment


    • #3
      It sounds like you wish to sample without replacement, though your mention of bsample suggests that you might also be interested in sampling with replacement? bsample does include a cluster() option that would allow you to sample schools (with their corresponding observations) with replacement.

      For sampling without replacement: Though randomtag looks like a userful command, here is an (untested) approach similar to Robert's that might be a little more transparent (doesn't require a user-defined command and avoids egen):


      Code:
      set seed 783489
      
      local sampsize 63
      
      // tag 1 observation from each school
      bys school: gen school_tag = _n==1
      
      // generate random number in a way that can be replicated if necessary
      bys school_tag (year): gen rn = runiform()
      
      // randomly select the desired number of schools by sorting tagged school
      //    obervations on the previously generated random number
      bys school_tag (rn): gen select = (school_tag == 1) & (_n <= `sampsize')
      
      // extend select indicator to all observations for the selected schools
      bys school (year): replace select = sum(select)
      bys school (year): replace select = select[_N]



      Comment


      • #4
        I'm all for using first principles to solve problems but that's usually because you trade off a few more commands for an increase in execution speed. The raison d'ĂȘtre of randomtag is that it does its thing without disturbing the data in memory AND without sorting the data in memory. While in this case the number of observations is low and it doesn't matter much, I would still encourage people to use randomtag because it is significantly more efficient than the first principles method suggested by Gary.

        Here's a more sizable problem with both methods. I adjusted Gary's picking order to match the one used by Stata's sample command. All egen commands are also avoided.

        Code:
        clear
        set obs 100000
        gen id = _n
        expand 20
        bysort id: gen year = 1990 + _n
        tempfile f
        qui save "`f'"
        
        local sampsize 63
        set seed 783489
        
        timer clear
        
        * adjust first principles to choose observations in the same order
        timer on 1
        bysort id (year): gen id_tag = _n==1
        gen skip = !id_tag
        sort skip id year
        gen double rn = runiform()
        sort skip rn
        gen select = (_n <= `sampsize')
        bysort id (year): replace select = select[1]
        timer off 1
        tempfile one
        qui save "`one'"
        
        * redo using -randomtag- (from SSC)
        set seed 783489
        use "`f'", clear
        timer on 2
        bysort id (year): gen id_tag = _n==1
        randomtag if id_tag, count(63) gen(t)
        by id: gen select = t[1]
        drop t
        timer off 2
        
        timer list
        
        * show that exactly the same observations are selected
        cf _all using "`one'", all
        Without randomtag, sampling requires at least two sort commands. Since the expected efficiency of a sort is O(n log n), Gary's first principles solution becomes increasingly inefficient as the number of observations grows. Here are the timing results for the problem above:

        Code:
        . timer list
           1:      3.91 /        1 =       3.9120
           2:      0.29 /        1 =       0.2930
        If Gary's "transparent" comment relates to a lack of trust in user-written programs, that's fine but better kept to oneself unless you have a reason to doubt that the program does what it says it does.

        Comment


        • #5
          Thank you both for your feedback. I ended up using the randomtag program to do the sampling for me. Robert, the code you provided above was extremely easy to adjust to suit my large dataset that I'm working on. Gary, your approach looked nice and simple as well but I ended up going with randomtag because, as Robert has shown, it's much faster in the long run and I have do files that are already taking a few hours to execute. Nevertheless, thanks for your timely response. I'll have your method written down for use in the future and I'm sure other Stata users will find this post helpful.

          Cheers,

          Thomas




          Last edited by Thomas Beatty; 28 Jul 2015, 16:28.

          Comment


          • #6
            I have a question similar to this and to the query in https://www.statalist.org/forums/for...ified-variable
            I would like to draw a random sample, in such a manner that the same distribution of a continuous variable in the original dataset remain in the sample. So far the only idea I have is to calculate percentiles and draw randomly samples over each percentile, then stitch together the 100 draws in one file. Is there a more elegant way to do this?
            Thank you very much.

            Comment

            Working...
            X