RE: Set Seed within Bysort

John MacDonald

Join Date: Feb 2016

Posts: 44
#1

RE: Set Seed within Bysort

18 Apr 2023, 13:30

I want to reproduce a randomization the same way each time. Normally I would do this through #set seed and get the same results. But when I used the code below I don't always get the same randomization. Any suggestions on how to randomize within bysort and maintain the same groups or an alternative. See code below.

*randomize to treatment clusters of 20*
set seed 123456
bysort cluster: gen rand = runiform()
sort rand
bysort cluster: gen n = _n //each cluster has 20 observations
// Treatment variable with three levels (5 in each treatment arm and 10 in the control arm)
generate grp = .
replace grp = 1 if n<=5
replace grp = 2 if n>5 & n<=10
replace grp = 3 if n>10
Tags: None
Ken Chui

Join Date: Aug 2014

Posts: 1057
#2

18 Apr 2023, 13:38

I was not able to reproduce this inconsistency. I tried the following code and every time the results were the same:

Code:

clear set obs 20 gen cluster = _n expand 20 set seed 123456 bysort cluster: gen rand = runiform() table cluster, statistic(mean rand)

Could you perhaps provide a self-contained subset of code? The one in #1 does not have a head and it's hard to try reproduce the error.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#3

18 Apr 2023, 13:43

Your code is incomplete in the sense that you don't show how you first generate -cluster, and so the code would not run.. Then again, it doesn't matter. The correct use of -set seed- in a Stata program is doing it once, at the top of your do-file. Thereafter, any need for random numbers will generate the same stream with repeated runs of the do-file. In the snippet you show, and more generally, sorting should never rely on a random seed. If it does, it means that you had better re-think your data because a sort order should be fully defined by the variables that make up the sort key. There is a discussion of good and bad practices on the use of random number seed in the PDF documentation that follows from -help set seed-.

If you post back with a more complete description, in English not Stata code, of what you want to do, maybe we can give you a better suggestion for code in your specific case.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#4

18 Apr 2023, 16:27

You have 20 observations within each cluster. When you -bysort cluster:..whatever...-, the sort is indeterminate: the order of the 20 observations within each cluster is unspecified, and Stata will randomize it. That is why you are getting different results each time. What you can do instead is this:

Code:

set seed 123456 gen double shuffle = runiform() by cluster (shuffle), sort: gen seq = _n gen grp = 1 if seq <= 5 replace grp = 2 if inrange(seq, 6, 10) replace grp = 3 if seq > 10

This will do a blocked randomization (block sizes of 20 defined by the cluster variable) with a 1:1:2 allocation.

Note: This works because, assuming your data set is no more than a few million observations, the values of the random numbers generated will contain no repetitions, so the cluster and shuffle will uniquely identify observations, making the sort order deterministic. If your data set is larger than that, then you need to generate two uniform random variables, -gen double shuffle2 = runiform()- and then sort on both of them: -by cluster (shuffle shuffle2), sort: gen seq = _n- to assure unique identification and, thereby, a reproducible sort order. Note that you need to store shuffle (and shuffle2 if needed) as doubles to guarantee this: at float precision there can be repetitions.
Comment
John MacDonald

Join Date: Feb 2016

Posts: 44
#5

19 Apr 2023, 06:36

Perfect- thank you.
Comment

Announcement

RE: Set Seed within Bysort

Comment

Comment

Comment

Comment