Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating a random normal variable with a fixed number of unique values

    Dear all,
    I'm using a data set with 5,000,000 observations and I'd like to generate a normally distributed random variable random which has (approximately) unique values: 2,000,000.
    So far, I've managed to generate a uniformly distributed variable
    random with unique values: 1,835,364 by:

    Code:
    clear
    set seed 2052018
    set obs 5000000
    gen double u = ((1999999)*runiform() + 1) * ((2^32-1)/2^32)
    gen random = round(u)
    Unfortunately, I can't seem to figure out how to obtain a normally distributed random that has the same number of unique values. Does anyone know a way to code this?

    Best
    Christian

  • #2
    A distribution can't be normal and discrete. I don't know why you want this, but the corresponding binomial is likely to be a better starting point. For example,

    Code:
    . set seed 2803
    
    . set obs 100000
    number of observations (_N) was 0, now 100,000
    
    . gen y = rbinomial(11, 0.5)
    
    . tab y
    
              y |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |         43        0.04        0.04
              1 |        554        0.55        0.60
              2 |      2,705        2.71        3.30
              3 |      8,058        8.06       11.36
              4 |     16,171       16.17       27.53
              5 |     22,673       22.67       50.20
              6 |     22,598       22.60       72.80
              7 |     15,822       15.82       88.62
              8 |      8,158        8.16       96.78
              9 |      2,655        2.65       99.44
             10 |        514        0.51       99.95
             11 |         49        0.05      100.00
    ------------+-----------------------------------
          Total |    100,000      100.00
    As always distinct is a better word than unique when it's what you mean. Unique still carries the primary sense of occurring just once. More at https://www.stata-journal.com/sjpdf....iclenum=dm0042

    That said, I think you'll be hard pushed to get 2 million distinct values AND approximately normal distributions without a much larger sample size than you're playing with.
    Last edited by Nick Cox; 02 May 2018, 08:20.

    Comment


    • #3
      I agree entirely with Nick, I cannot imagine the purpose of generating a continuous random variable and constraining it to a limited number of distinct values. With that said, another approach is to generate a full set of random values, and then coarsen them. The following code, run on my system, produces the coarsened variable y with 2,000,000 distinct values, and inspecting the summary statistics for x and y shows their distributions are essentially identical.
      Code:
      clear
      set obs 5000000
      set seed 2052018
      generate double x = rnormal()
      sort x
      generate long group = floor((_n-1)*2/5)+1
      bysort group (x): generate double  y = cond(_N==3,x[2],(x[1]+x[2])/2)
      codebook x y, compact
      summarize x y, detail
      Code:
      . codebook x y, compact
      
      Variable       Obs  Unique       Mean        Min       Max  Label
      ------------------------------------------------------------------------------------------------
      x          5000000 5000000  -.0003158  -4.724853  5.376398  
      y          5000000 2000000  -.0003159  -4.719964  5.176187  
      ------------------------------------------------------------------------------------------------
      
      . summarize x y, detail
      
                                    x
      -------------------------------------------------------------
            Percentiles      Smallest
       1%    -2.326908      -4.724853
       5%    -1.645423      -4.719964
      10%    -1.282543      -4.695449       Obs           5,000,000
      25%    -.6749557      -4.650745       Sum of Wgt.   5,000,000
      
      50%    -.0002361                      Mean          -.0003158
                              Largest       Std. Dev.      1.000163
      75%     .6742286       4.952557
      90%     1.282528       4.969922       Variance       1.000326
      95%     1.644754       4.975977       Skewness       .0003641
      99%     2.325678       5.376398       Kurtosis       2.998477
      
                                    y
      -------------------------------------------------------------
            Percentiles      Smallest
       1%    -2.326908      -4.719964
       5%    -1.645423      -4.719964
      10%    -1.282542      -4.719964       Obs           5,000,000
      25%    -.6749554      -4.645181       Sum of Wgt.   5,000,000
      
      50%    -.0002357                      Mean          -.0003159
                              Largest       Std. Dev.      1.000163
      75%     .6742288       4.952557
      90%     1.282527       4.952557       Variance       1.000326
      95%     1.644754       5.176187       Skewness        .000362
      99%     2.325678       5.176187       Kurtosis       2.998475

      Comment

      Working...
      X