Choosing random group member observation in the presence of missing values

Nora Bushel

Join Date: Nov 2019

Posts: 19
#1

Choosing random group member observation in the presence of missing values

16 Mar 2020, 09:53

Dear all:

I have a data set that looks roughly like this, i.e. individuals in groups. For most but not all individuals I observe a variable value, e.g. income or blood pressure or whatever helps you imagine.

Code:

// Create data set clear set obs 50 egen group = seq(), from(1) to(5) // 5 groups gen indiv = _n // 10 members each generate value = round(runiform()*100, 1) // Value between 1 and 100 gen r = runiform() // Random variable for sorting and for replace value = . if (inrange(r, .02, .1) | inrange(r, .8, .9)) // missing values

I now want to assign each group a group-level value which should be a random draw of value from each group:

Code:

// Choose random value as group value bysort group (r): gen groupvalue = value[1]

But I don't want it to be a missing value! How do I do that best? My data set is huge, will have to run this many times, worried about coming up with a suboptimal solution myself that will cause me pain down the line.

Cheers
Nora
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 29948

16 Mar 2020, 10:13

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte group float(indiv value)
1  1  .
2  2 27
3  3 14
4  4  3
5  5 87
1  6 35
2  7  .
3  8 32
4  9 56
5 10 88
1 11 20
2 12 89
3 13 58
4 14 37
5 15 85
1 16 39
2 17 12
3 18 75
4 19 70
5 20 69
1 21  .
2 22 45
3 23  7
4 24 34
5 25 97
1 26 73
2 27  5
3 28  .
4 29 50
5 30 72
1 31 86
2 32 13
3 33  .
4 34  .
5 35 77
1 36 25
2 37 17
3 38 74
4 39  .
5 40 73
1 41  .
2 42 26
3 43  .
4 44 88
5 45 75
1 46 92
2 47 69
3 48 22
4 49 83
5 50  4
end

gen byte mv = missing(value)

set seed 1234
gen double shuffle = runiform()
by group (mv shuffle), sort: gen selected = value[1]

This will only fail if every observation in a group has a missing value--but then there is no solution to your problem in any case.

Added: I don't know what you mean when you say your data set is huge. But if the number of observations exceeds maybe 2 or 3 million, then for selecting a random observation, you should generate two double precision uniform random numbers, shuffle1 and shuffle2, and then sort on the pair of them to avoid arbitrary irreproducible selection due to ties.

Last edited by Clyde Schechter; 16 Mar 2020, 10:16.

Comment

Nora Bushel

Join Date: Nov 2019

Posts: 19
#3

17 Mar 2020, 03:30

Thanks, Clyde, that helped a lot!

NB
Comment

Announcement