Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • RE: Set Seed within Bysort

    I want to reproduce a randomization the same way each time. Normally I would do this through #set seed and get the same results. But when I used the code below I don't always get the same randomization. Any suggestions on how to randomize within bysort and maintain the same groups or an alternative. See code below.

    *randomize to treatment clusters of 20*
    set seed 123456
    bysort cluster: gen rand = runiform()
    sort rand
    bysort cluster: gen n = _n //each cluster has 20 observations
    // Treatment variable with three levels (5 in each treatment arm and 10 in the control arm)
    generate grp = .
    replace grp = 1 if n<=5
    replace grp = 2 if n>5 & n<=10
    replace grp = 3 if n>10

  • #2
    I was not able to reproduce this inconsistency. I tried the following code and every time the results were the same:

    Code:
    clear
    set obs 20
    gen cluster = _n
    expand 20
    
    set seed 123456
    bysort cluster: gen rand = runiform()
    
    table cluster, statistic(mean rand)
    Could you perhaps provide a self-contained subset of code? The one in #1 does not have a head and it's hard to try reproduce the error.

    Comment


    • #3
      Your code is incomplete in the sense that you don't show how you first generate -cluster, and so the code would not run.. Then again, it doesn't matter. The correct use of -set seed- in a Stata program is doing it once, at the top of your do-file. Thereafter, any need for random numbers will generate the same stream with repeated runs of the do-file. In the snippet you show, and more generally, sorting should never rely on a random seed. If it does, it means that you had better re-think your data because a sort order should be fully defined by the variables that make up the sort key. There is a discussion of good and bad practices on the use of random number seed in the PDF documentation that follows from -help set seed-.

      If you post back with a more complete description, in English not Stata code, of what you want to do, maybe we can give you a better suggestion for code in your specific case.

      Comment


      • #4
        You have 20 observations within each cluster. When you -bysort cluster:..whatever...-, the sort is indeterminate: the order of the 20 observations within each cluster is unspecified, and Stata will randomize it. That is why you are getting different results each time. What you can do instead is this:
        Code:
        set seed 123456
        gen double shuffle = runiform()
        by cluster (shuffle), sort: gen seq = _n
        gen grp = 1 if seq <= 5
        replace grp = 2 if inrange(seq, 6, 10)
        replace grp = 3 if seq > 10
        This will do a blocked randomization (block sizes of 20 defined by the cluster variable) with a 1:1:2 allocation.

        Note: This works because, assuming your data set is no more than a few million observations, the values of the random numbers generated will contain no repetitions, so the cluster and shuffle will uniquely identify observations, making the sort order deterministic. If your data set is larger than that, then you need to generate two uniform random variables, -gen double shuffle2 = runiform()- and then sort on both of them: -by cluster (shuffle shuffle2), sort: gen seq = _n- to assure unique identification and, thereby, a reproducible sort order. Note that you need to store shuffle (and shuffle2 if needed) as doubles to guarantee this: at float precision there can be repetitions.

        Comment


        • #5
          Perfect- thank you.

          Comment

          Working...
          X