Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • gen whole lot random variables (and random order) efficiently

    Hi all,

    One of the statistical computations that I've been working on involves assigning 10,000 columns of random orders. I can't really think of any efficient way for this part of the task, so I want to seek help from you guys. Detailed info and example see below:

    What I want to do is to generate 10,000 sets of random orders from 1 to 1000. It should look something like this:
    order1 order2 order3 ... order10000
    1 999 567 ... 1000
    2 976 432 ... 999
    ... ... ... ... ...
    1000 2 254 ... 1
    One way that I can think of to achieve what I want is:
    PHP Code:
    clear all
    set obs 1000

    forvalues i 
    1/10000 {
        
    gen u`i' = runiform()
        sort u
    `i'    
        gen order`i' 
    _n

    But as you can imagine, this is pretty inefficient. Is there any way you know to generate the wanted data more efficiently?

    Thank,
    CL
    Last edited by ChungLi Wu; 12 Apr 2022, 16:36.

  • #2
    What really slows down your code is the repetitive use of -sort-, because the rest of the loop is pretty fast otherwise. Essentially, you have approached the problem from a "wide" mindset, but you can also structure this from a "long" mindset.

    The benefit of setting up all your data in a long format is that only a few sorts are required, and then you can reshape the data into the desired wide format. The downside here is that the standard -reshape- is known to be slow. For this to work and be competitive, you must use one of the user-contributed alternatives to -reshape-. Here I demonstrate with -greshape- (from package -gtools- on SSC). The example below runs an order of magnitude faster than your for loop did on my machine (~2 min vs ~20 seconds).

    Code:
    set obs 10000
    gen row = _n
    expand 1000
    bys row : gen id = _n
    gen u = runiform()
    bys id (u) : gen order = _n
    // at this point, if you don't need to keep -u-, simply drop it.
    greshape wide u order, i(row) j(id)

    Comment


    • #3
      Would making u a tempvar make the code even faster?

      Comment


      • #4
        Originally posted by Jared Greathouse View Post
        Would making u a tempvar make the code even faster?
        I shouldn’t think so. The variable still needs to be created and held in memory. It’s temporary nature requires a little overhead to keep track of naming and dropping it. If u isn’t needed at all then it can simply be dropped after it’s no longer needed, but there’s room here for being more memory efficient.

        Comment


        • #5
          Which of the other major statistical software providers (including noncommercial, e.g., R, Python, Julia) has not made double precision the default numerical data type?

          .ÿ
          .ÿversionÿ17.0

          .ÿ
          .ÿclearÿ*

          .ÿ
          .ÿlocalÿstateÿ=ÿc(rngstate)

          .ÿ
          .ÿ//ÿBeginÿverbatimÿfromÿ#2ÿofÿtheÿthread
          .ÿsetÿobsÿ10000
          Numberÿofÿobservationsÿ(_N)ÿwasÿ0,ÿnowÿ10,000.

          .ÿgenÿrowÿ=ÿ_n

          .ÿexpandÿ1000
          (9,990,000ÿobservationsÿcreated)

          .ÿbysÿrowÿ:ÿgenÿidÿ=ÿ_n

          .ÿgenÿuÿ=ÿruniform()

          .ÿbysÿidÿ(u)ÿ:ÿgenÿorderÿ=ÿ_n

          .ÿ//ÿEndÿverbatimÿfromÿ#2
          .ÿ
          .ÿgenerateÿbyteÿdupÿ=ÿuÿ==ÿu[_n-1]ÿ&ÿidÿ==ÿid[_n-1]

          .ÿtabulateÿdup

          ÿÿÿÿÿÿÿÿdupÿ|ÿÿÿÿÿÿFreq.ÿÿÿÿÿPercentÿÿÿÿÿÿÿÿCum.
          ------------+-----------------------------------
          ÿÿÿÿÿÿÿÿÿÿ0ÿ|ÿÿ9,998,017ÿÿÿÿÿÿÿ99.98ÿÿÿÿÿÿÿ99.98
          ÿÿÿÿÿÿÿÿÿÿ1ÿ|ÿÿÿÿÿÿ1,983ÿÿÿÿÿÿÿÿ0.02ÿÿÿÿÿÿ100.00
          ------------+-----------------------------------
          ÿÿÿÿÿÿTotalÿ|ÿ10,000,000ÿÿÿÿÿÿ100.00

          .ÿ
          .ÿ//ÿSuggestion
          .ÿdropÿ_all

          .ÿ
          .ÿsetÿrngstateÿ`state'

          .ÿ
          .ÿquietlyÿsetÿobsÿ10000

          .ÿgenerateÿintÿrowÿ=ÿ_n

          .ÿ
          .ÿquietlyÿexpandÿ1000

          .ÿbysortÿrow:ÿgenerateÿintÿidÿ=ÿ_n

          .ÿgenerateÿdoubleÿuÿ=ÿruniform()

          .ÿbysortÿidÿ(u)ÿ:ÿgenerateÿintÿorderÿ=ÿ_n

          .ÿ
          .ÿgenerateÿbyteÿdupÿ=ÿuÿ==ÿu[_n-1]ÿ&ÿidÿ==ÿid[_n-1]

          .ÿtabulateÿdup

          ÿÿÿÿÿÿÿÿdupÿ|ÿÿÿÿÿÿFreq.ÿÿÿÿÿPercentÿÿÿÿÿÿÿÿCum.
          ------------+-----------------------------------
          ÿÿÿÿÿÿÿÿÿÿ0ÿ|ÿ10,000,000ÿÿÿÿÿÿ100.00ÿÿÿÿÿÿ100.00
          ------------+-----------------------------------
          ÿÿÿÿÿÿTotalÿ|ÿ10,000,000ÿÿÿÿÿÿ100.00

          .ÿ
          .ÿtempnameÿfile_handle

          .ÿfileÿopenÿ`file_handle'ÿusingÿRNGState.txt,ÿwriteÿtext

          .ÿfileÿwriteÿ`file_handle'ÿ"`state'"

          .ÿfileÿcloseÿ`file_handle'

          .ÿ
          .ÿexit

          endÿofÿdo-file


          .


          (Because the pseudorandom-number generator's seed wasn't set in the example code, I ran the do-file from a fresh instantiation of Stata and used the startup seed. It's attached in a text file for reference.)
          Attached Files

          Comment


          • #6
            Originally posted by Joseph Coveney View Post
            Which of the other major statistical software providers (including noncommercial, e.g., R, Python, Julia) has not made double precision the default numerical data type?

            (Because the pseudorandom-number generator's seed wasn't set in the example code, I ran the do-file from a fresh instantiation of Stata and used the startup seed. It's attached in a text file for reference.)
            I’m not sure, though double is my default type for Stata, so I wouldn’t have thought to check for duplicates. Also setting the seed is important though I thought not to mention it since the focus was on efficiency. With your code #5, are you suggesting that duplicates or floating-point precision would be a problem with this random generation process?

            Comment


            • #7
              I don't know what the OP is up to, and so I don' know whether 2000 unintended ties that result from the use of Stata's default single-precision floating point storage type would be a problem in this particular case. But in general if I want to do anything* (not just sort a column) on the basis of a series of random numbers, then I want uniqueness in the values. That is not attained in this random generation process with Stata's default precision.
              Last edited by Joseph Coveney; 13 Apr 2022, 01:20. Reason: *Added in edit:anything that requires uniqueness.

              Comment


              • #8
                Originally posted by Leonardo Guizzetti View Post
                I’m not sure, though double is my default type for Stata, so I wouldn’t have thought to check for duplicates. Also setting the seed is important though I thought not to mention it since the focus was on efficiency.
                By the way, I wasn't intending to be critical of your very helpful suggestions to the OP. And the seed thing, I agree, it's not pertinent to your suggestions; it was only because I wanted to run your code wiithout modification—the seed setting (RNG state) will affect just how many duplicates you'll get, and I wanted to assure reproducibility of the tabulation that I showed.

                My main point is about Stata's choice of default precision and that choice's unintended consequences that can give rise to questions about unexpected behavior, and they do arise rather more frequently than necessary, on the List.

                You've obviously given the default some consideration. Every other statistical software provider that I've checked has done the same.

                It's not a talisman—I've encountered a circumstance in a particularly large dataset years ago where I had to -generate- two columns of double-precision -runiform()- and sort on them both in order to avoid duplicates—but it does seem as if other statistical software providers (even Microsoft Excel uses double precision) are onto something.

                Just adding a simple recommendation
                Code:
                set type double, permanently
                to Stata's installation instructions would seem like it could go a long way toward avoiding precision-related unintended consequences, some of which might not be discovered by the user.

                Comment


                • #9
                  That makes senses, thanks for clarifying what you intending to demonstrate. I also agree that if uniqueness is required then it’s always a good idea to check that you have achieved that criterion.

                  Comment

                  Working...
                  X