Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • dropping 1 of 2 randomly-selected duplicate observations

    I have a dataset of observations on children- some singletons, and some sets of 2 or 3 siblings. I want to only keep 1 of the siblings for analysis, selected randomly from the 2 or 3 in the dataset. They can be identified by having the same household ID (hhid) but have different child IDs (childid). I have identified them based on their having duplicate hhid, but I don't want to use the "duplicates drop" command because that will keep the first observation, and I would like to keep a randomly selected observation. What is the best way to do this?

  • #2
    Try this:

    Code:
    set seed 1234 // OR WHATEVER SEED YOU WISH
    gen double shuffle = runiform()
    by hhid (shuffle), sort: keep if _n == 1
    Notes: 1. To assure that the process is reproducible, you need to specify the random number seed. It doesn't really matter what number you pick, I have given 1234 as an example.
    2. I don't know how large your data set is. If it is really huge, you might need to generate two random numbers, shuffle1 and shuffle2, to avoid having any ties. But unless you are dealing with several million observations, (which I think would be surprising for a sibling data set) just one random number will suffice. The reason you don't want any ties is that in the -by hhid (shuffle), sort...- statement, Stata will break those ties in an irreproducible way.

    Comment


    • #3
      The dataset is only 1,000 observations and this worked perfectly. thanks very much!

      Comment


      • #4
        Hello there, I had the same question and this code is very helpful, thank you Clyde. I am unclear what a random number seed is though, would you kindly explain?

        Comment


        • #5
          Originally posted by Meghna Mahambrey View Post
          Hello there, I had the same question and this code is very helpful, thank you Clyde. I am unclear what a random number seed is though, would you kindly explain?
          A computer cannot really generate truly random numbers. In reality, they are (very well) approximated by a variety of deterministic, mathematical functions. A particular "seed" sets the specific initial value used to generate pseudo-random numbers. It is viewed as good practice to set the seed once per program so that the results of that code may be replicated in future (for debugging, reproducibility, etc).

          More details can be found by reading the output of -help set seed-.

          Comment

          Working...
          X