Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Choosing random group member observation in the presence of missing values

    Dear all:

    I have a data set that looks roughly like this, i.e. individuals in groups. For most but not all individuals I observe a variable value, e.g. income or blood pressure or whatever helps you imagine.

    Code:
    // Create data set
    clear
    set obs 50
    egen group = seq(), from(1) to(5)         // 5 groups
    gen indiv = _n                            // 10 members each
    generate value = round(runiform()*100, 1) // Value between 1 and 100
    gen r = runiform()                        // Random variable for sorting and for
    replace value = . if (inrange(r, .02, .1) | inrange(r, .8, .9)) // missing values
    I now want to assign each group a group-level value which should be a random draw of value from each group:

    Code:
    // Choose random value as group value
    bysort group (r): gen groupvalue = value[1]
    But I don't want it to be a missing value! How do I do that best? My data set is huge, will have to run this many times, worried about coming up with a suboptimal solution myself that will cause me pain down the line.

    Cheers
    Nora

  • #2
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte group float(indiv value)
    1  1  .
    2  2 27
    3  3 14
    4  4  3
    5  5 87
    1  6 35
    2  7  .
    3  8 32
    4  9 56
    5 10 88
    1 11 20
    2 12 89
    3 13 58
    4 14 37
    5 15 85
    1 16 39
    2 17 12
    3 18 75
    4 19 70
    5 20 69
    1 21  .
    2 22 45
    3 23  7
    4 24 34
    5 25 97
    1 26 73
    2 27  5
    3 28  .
    4 29 50
    5 30 72
    1 31 86
    2 32 13
    3 33  .
    4 34  .
    5 35 77
    1 36 25
    2 37 17
    3 38 74
    4 39  .
    5 40 73
    1 41  .
    2 42 26
    3 43  .
    4 44 88
    5 45 75
    1 46 92
    2 47 69
    3 48 22
    4 49 83
    5 50  4
    end
    
    gen byte mv = missing(value)
    
    set seed 1234
    gen double shuffle = runiform()
    by group (mv shuffle), sort: gen selected = value[1]
    This will only fail if every observation in a group has a missing value--but then there is no solution to your problem in any case.

    Added: I don't know what you mean when you say your data set is huge. But if the number of observations exceeds maybe 2 or 3 million, then for selecting a random observation, you should generate two double precision uniform random numbers, shuffle1 and shuffle2, and then sort on the pair of them to avoid arbitrary irreproducible selection due to ties.
    Last edited by Clyde Schechter; 16 Mar 2020, 10:16.

    Comment


    • #3
      Thanks, Clyde, that helped a lot!

      NB

      Comment

      Working...
      X