dropping 1 of 2 randomly-selected duplicate observations

Jillian Emerson

Join Date: Jan 2016

Posts: 12
#1

dropping 1 of 2 randomly-selected duplicate observations

22 Jan 2016, 14:47

I have a dataset of observations on children- some singletons, and some sets of 2 or 3 siblings. I want to only keep 1 of the siblings for analysis, selected randomly from the 2 or 3 in the dataset. They can be identified by having the same household ID (hhid) but have different child IDs (childid). I have identified them based on their having duplicate hhid, but I don't want to use the "duplicates drop" command because that will keep the first observation, and I would like to keep a randomly selected observation. What is the best way to do this?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

22 Jan 2016, 14:57

Try this:

Code:

set seed 1234 // OR WHATEVER SEED YOU WISH gen double shuffle = runiform() by hhid (shuffle), sort: keep if _n == 1

Notes: 1. To assure that the process is reproducible, you need to specify the random number seed. It doesn't really matter what number you pick, I have given 1234 as an example.
2. I don't know how large your data set is. If it is really huge, you might need to generate two random numbers, shuffle1 and shuffle2, to avoid having any ties. But unless you are dealing with several million observations, (which I think would be surprising for a sibling data set) just one random number will suffice. The reason you don't want any ties is that in the -by hhid (shuffle), sort...- statement, Stata will break those ties in an irreproducible way.
1 like
Comment
Jillian Emerson

Join Date: Jan 2016

Posts: 12
#3

23 Jan 2016, 10:13

The dataset is only 1,000 observations and this worked perfectly. thanks very much!
Comment
Meghna Mahambrey

Join Date: Nov 2017

Posts: 5
#4

06 Jan 2020, 10:24

Hello there, I had the same question and this code is very helpful, thank you Clyde. I am unclear what a random number seed is though, would you kindly explain?
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#5

06 Jan 2020, 10:42

Originally posted by Meghna Mahambrey View Post

Hello there, I had the same question and this code is very helpful, thank you Clyde. I am unclear what a random number seed is though, would you kindly explain?

A computer cannot really generate truly random numbers. In reality, they are (very well) approximated by a variety of deterministic, mathematical functions. A particular "seed" sets the specific initial value used to generate pseudo-random numbers. It is viewed as good practice to set the seed once per program so that the results of that code may be replicated in future (for debugging, reproducibility, etc).

More details can be found by reading the output of -help set seed-.
2 likes
Comment

Announcement

dropping 1 of 2 randomly-selected duplicate observations

Comment

Comment

Comment

Comment