gen whole lot random variables (and random order) efficiently

ChungLi Wu

Join Date: May 2020

Posts: 13
#1

gen whole lot random variables (and random order) efficiently

12 Apr 2022, 16:17

Hi all,

One of the statistical computations that I've been working on involves assigning 10,000 columns of random orders. I can't really think of any efficient way for this part of the task, so I want to seek help from you guys. Detailed info and example see below:

What I want to do is to generate 10,000 sets of random orders from 1 to 1000. It should look something like this:
order1 order2 order3 ... order10000

1 999 567 ... 1000

2 976 432 ... 999

... ... ... ... ...

1000 2 254 ... 1

One way that I can think of to achieve what I want is:

PHP Code:

clear all set obs 1000 forvalues i = 1/10000 { gen u`i' = runiform() sort u`i' gen order`i' = _n }

But as you can imagine, this is pretty inefficient. Is there any way you know to generate the wanted data more efficiently?

Thank,
CL

Last edited by ChungLi Wu; 12 Apr 2022, 16:36.
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2371
#2

12 Apr 2022, 20:32

What really slows down your code is the repetitive use of -sort-, because the rest of the loop is pretty fast otherwise. Essentially, you have approached the problem from a "wide" mindset, but you can also structure this from a "long" mindset.

The benefit of setting up all your data in a long format is that only a few sorts are required, and then you can reshape the data into the desired wide format. The downside here is that the standard -reshape- is known to be slow. For this to work and be competitive, you must use one of the user-contributed alternatives to -reshape-. Here I demonstrate with -greshape- (from package -gtools- on SSC). The example below runs an order of magnitude faster than your for loop did on my machine (~2 min vs ~20 seconds).

Code:

set obs 10000 gen row = _n expand 1000 bys row : gen id = _n gen u = runiform() bys id (u) : gen order = _n // at this point, if you don't need to keep -u-, simply drop it. greshape wide u order, i(row) j(id)
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#3

12 Apr 2022, 20:46

Would making u a tempvar make the code even faster?
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2371
#4

12 Apr 2022, 21:58

Originally posted by Jared Greathouse View Post

Would making u a tempvar make the code even faster?

I shouldn’t think so. The variable still needs to be created and held in memory. It’s temporary nature requires a little overhead to keep track of naming and dropping it. If u isn’t needed at all then it can simply be dropped after it’s no longer needed, but there’s room here for being more memory efficient.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4355
#5

12 Apr 2022, 23:23

Which of the other major statistical software providers (including noncommercial, e.g., R, Python, Julia) has not made double precision the default numerical data type?

.ÿ
.ÿversionÿ17.0

.ÿ
.ÿclearÿ*

.ÿ
.ÿlocalÿstateÿ=ÿc(rngstate)

.ÿ
.ÿ//ÿBeginÿverbatimÿfromÿ#2ÿofÿtheÿthread
.ÿsetÿobsÿ10000
Numberÿofÿobservationsÿ(_N)ÿwasÿ0,ÿnowÿ10,000.

.ÿgenÿrowÿ=ÿ_n

.ÿexpandÿ1000
(9,990,000ÿobservationsÿcreated)

.ÿbysÿrowÿ:ÿgenÿidÿ=ÿ_n

.ÿgenÿuÿ=ÿruniform()

.ÿbysÿidÿ(u)ÿ:ÿgenÿorderÿ=ÿ_n

.ÿ//ÿEndÿverbatimÿfromÿ#2
.ÿ
.ÿgenerateÿbyteÿdupÿ=ÿuÿ==ÿu[_n-1]ÿ&ÿidÿ==ÿid[_n-1]

.ÿtabulateÿdup

ÿÿÿÿÿÿÿÿdupÿ|ÿÿÿÿÿÿFreq.ÿÿÿÿÿPercentÿÿÿÿÿÿÿÿCum.
------------+-----------------------------------
ÿÿÿÿÿÿÿÿÿÿ0ÿ|ÿÿ9,998,017ÿÿÿÿÿÿÿ99.98ÿÿÿÿÿÿÿ99.98
ÿÿÿÿÿÿÿÿÿÿ1ÿ|ÿÿÿÿÿÿ1,983ÿÿÿÿÿÿÿÿ0.02ÿÿÿÿÿÿ100.00
------------+-----------------------------------
ÿÿÿÿÿÿTotalÿ|ÿ10,000,000ÿÿÿÿÿÿ100.00

.ÿ
.ÿ//ÿSuggestion
.ÿdropÿ_all

.ÿ
.ÿsetÿrngstateÿ`state'

.ÿ
.ÿquietlyÿsetÿobsÿ10000

.ÿgenerateÿintÿrowÿ=ÿ_n

.ÿ
.ÿquietlyÿexpandÿ1000

.ÿbysortÿrow:ÿgenerateÿintÿidÿ=ÿ_n

.ÿgenerateÿdoubleÿuÿ=ÿruniform()

.ÿbysortÿidÿ(u)ÿ:ÿgenerateÿintÿorderÿ=ÿ_n

.ÿ
.ÿgenerateÿbyteÿdupÿ=ÿuÿ==ÿu[_n-1]ÿ&ÿidÿ==ÿid[_n-1]

.ÿtabulateÿdup

ÿÿÿÿÿÿÿÿdupÿ|ÿÿÿÿÿÿFreq.ÿÿÿÿÿPercentÿÿÿÿÿÿÿÿCum.
------------+-----------------------------------
ÿÿÿÿÿÿÿÿÿÿ0ÿ|ÿ10,000,000ÿÿÿÿÿÿ100.00ÿÿÿÿÿÿ100.00
------------+-----------------------------------
ÿÿÿÿÿÿTotalÿ|ÿ10,000,000ÿÿÿÿÿÿ100.00

.ÿ
.ÿtempnameÿfile_handle

.ÿfileÿopenÿ`file_handle'ÿusingÿRNGState.txt,ÿwriteÿtext

.ÿfileÿwriteÿ`file_handle'ÿ"`state'"

.ÿfileÿcloseÿ`file_handle'

.ÿ
.ÿexit

endÿofÿdo-file

.

(Because the pseudorandom-number generator's seed wasn't set in the example code, I ran the do-file from a fresh instantiation of Stata and used the startup seed. It's attached in a text file for reference.)
Attached Files

RNGState.txt (4.9 KB, 1 view)
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2371
#6

12 Apr 2022, 23:58

Originally posted by Joseph Coveney View Post

Which of the other major statistical software providers (including noncommercial, e.g., R, Python, Julia) has not made double precision the default numerical data type?

(Because the pseudorandom-number generator's seed wasn't set in the example code, I ran the do-file from a fresh instantiation of Stata and used the startup seed. It's attached in a text file for reference.)

I’m not sure, though double is my default type for Stata, so I wouldn’t have thought to check for duplicates. Also setting the seed is important though I thought not to mention it since the focus was on efficiency. With your code #5, are you suggesting that duplicates or floating-point precision would be a problem with this random generation process?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4355
#7

13 Apr 2022, 01:13

I don't know what the OP is up to, and so I don' know whether 2000 unintended ties that result from the use of Stata's default single-precision floating point storage type would be a problem in this particular case. But in general if I want to do anything* (not just sort a column) on the basis of a series of random numbers, then I want uniqueness in the values. That is not attained in this random generation process with Stata's default precision.

Last edited by Joseph Coveney; 13 Apr 2022, 01:20. Reason: *Added in edit:anything that requires uniqueness.
1 like
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4355
#8

13 Apr 2022, 03:39

Originally posted by Leonardo Guizzetti View Post

I’m not sure, though double is my default type for Stata, so I wouldn’t have thought to check for duplicates. Also setting the seed is important though I thought not to mention it since the focus was on efficiency.

By the way, I wasn't intending to be critical of your very helpful suggestions to the OP. And the seed thing, I agree, it's not pertinent to your suggestions; it was only because I wanted to run your code wiithout modification—the seed setting (RNG state) will affect just how many duplicates you'll get, and I wanted to assure reproducibility of the tabulation that I showed.

My main point is about Stata's choice of default precision and that choice's unintended consequences that can give rise to questions about unexpected behavior, and they do arise rather more frequently than necessary, on the List.

You've obviously given the default some consideration. Every other statistical software provider that I've checked has done the same.

It's not a talisman—I've encountered a circumstance in a particularly large dataset years ago where I had to -generate- two columns of double-precision -runiform()- and sort on them both in order to avoid duplicates—but it does seem as if other statistical software providers (even Microsoft Excel uses double precision) are onto something.

Just adding a simple recommendation

Code:

set type double, permanently

to Stata's installation instructions would seem like it could go a long way toward avoiding precision-related unintended consequences, some of which might not be discovered by the user.
1 like
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2371
#9

13 Apr 2022, 06:44

That makes senses, thanks for clarifying what you intending to demonstrate. I also agree that if uniqueness is required then it’s always a good idea to check that you have achieved that criterion.
Comment

order1	order2	order3	...	order10000
1	999	567	...	1000
2	976	432	...	999
...	...	...	...	...
1000	2	254	...	1

Announcement

gen whole lot random variables (and random order) efficiently

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment