Random sample not reproducible despite set seed

Rike Bruchmann

Join Date: May 2016
Posts: 24

Random sample not reproducible despite set seed

29 Jun 2016, 09:34

Dear everyone,

I want to draw a random sample of 500 observations that should be reproducible when running the do-file again and again. Despite the set seed command, I always get slightly different results.

Code:

egen company_tag = tag(company_uuid) //pick one observation to represent each company
set seed 77
randomtag if company_tag, count(500) gen(t) //select a random sample from the tagged obs. this
bysort company_uuid: egen select = total(t) //keep all observations from picked companies
sum no_org if select == 1 //500 observations
keep if select == 1
drop count_investor count_investor_index
by investor_uuid, sort: gen count_investor = 1 if _n == 1
by investor_uuid, sort: gen count_investor_index = _n
sum count_investor if ba == 1 //901 BAs
sum count_investor if ba == 1 & investor_type == "individual" //810 BAs
sum count_investor if vc == 1 //173 VCs
sum no_org if tag == 2 //118 startups
sum no_org if tag == 3 //382 startups

Thanks a lot for helping me on that!

Best,
Rike

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#2

29 Jun 2016, 09:55

The problem arises because you have

Code:

egen company_tag = tag(company_uuid) //pick one observation to represent each company

before you set the seed. The -egen, tag()- command picks one observation per company_uuid. It selects that one by sorting on company_uuid and then taging the first. But since company_uuid does not identify observations uniquely (if it did you would have no need of company_tag), the sort order is indeterminate and is random. So each time you run this, different observations are tagged, and everything thereafter is indeterminate.

So you need to set the seed before you use -egen, tag()-. You also should check the code leading up to what you show us as it, too, may contain indeterminate (explicit or implicit) sorts that affect later reproducibility.
Comment
Rike Bruchmann

Join Date: May 2016

Posts: 24
#3

29 Jun 2016, 10:10

Dear Clyde,

thank you for your fast response! I put the set seed command before egen and still the results slightly change. It might be due to indeterminate reasons as you mentioned before, although I don't really know how to "spot" those.

Especially since I have a long do-file creating my final dataset by always ending up with the same number of observations (.e.g ventures, investors). Also, when I create count variables, I always sort the respective data before, which is why they should always result in the same order, don't they?

Thanks a lot for your help!

Last edited by Rike Bruchmann; 29 Jun 2016, 10:20.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#4

29 Jun 2016, 10:18

I agree with Clyde, this is most likely due to the egen, tag() picking different observations at each run. However, the set seed command does not alter how data is sorted. Check the sort order prior to picking the representative observation per company. See this FAQ for an explanation of what must be going on.

Last edited by Robert Picard; 29 Jun 2016, 10:24.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

29 Jun 2016, 10:24

Here's a quick example of one way this could happen and how to spot it:

Code:

webuse grunfeld, clear
gen long obs = _n

bysort year: gen nobsperyear = _N
bysort company: gen nobsperco = _N
egen company_tag = tag(company)
sum obs if company_tag

A first run gives me

Code:

. sum obs if company_tag

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         obs |         10        95.9    56.75376         17        186

but the next run is:

Code:

. sum obs if company_tag

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         obs |         10        99.3    61.24641          2        182

Comment

Rike Bruchmann

Join Date: May 2016

Posts: 24
#6

29 Jun 2016, 10:35

Dear Robert,

I understand what you mean, thanks for showing this example! I guess there is no other way of randomly (but systematically) sample the same companies (and thereby their total observations) when using set seed, right? For example, instead of picking a different observation per company each time, that always the first one is picked?
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#7

29 Jun 2016, 11:48

No, randomtag (from SSC) will always pick exactly the same observations if the same seed was used. If you end up with different results downstream, then the state of the data was not the same when you called randomtag. There a nifty Stata command (isid) to sort data and verify that the data is fully sorted. Adjusting the example in #5:

Code:

webuse grunfeld, clear gen long obs = _n bysort year: gen nobsperyear = _N bysort company: gen nobsperco = _N isid company year, sort egen company_tag = tag(company) sum obs if company_tag

will always generate the same results. But even with fully sorted data, you may get different results if the number of observations has changed. To confirm that the data you feed randomtag changes at every run, put

Code:

save "test_order.dta"

just before the randomtag statement. On the first run of the do-file, the state of the data will be saved to "test_order.dta". On subsequent runs, the execution will halt at the save statement because the file already exits. You can then compare the data in memory with the one on disk using:

Code:

cf _all using "test_order.dta", all
Comment

Announcement

Random sample not reproducible despite set seed

Comment

Comment

Comment

Comment

Comment

Comment