Creating N subpopulations by random sampling without replacement

Karel Novak

Join Date: Mar 2018

Posts: 37
#1

Creating N subpopulations by random sampling without replacement

25 Dec 2021, 05:53

Dear all, I would like to ask you, wheter it is possible in Stata to create N subpopulations via simple random sampling from the whole population (the population dataset is avaible).

The restrictions are that (1.) every case has to be part of one newly created subpopulation and (2.) those subpopulations have disjunction among themselves (no unique case can be part of two or more subpopulation).
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#2

25 Dec 2021, 06:34

If I understand you correctly, you really just want to split the population into randomly divided subgroups, or to create a split sample. This is certainly possible. The only thing you need to determine is the size of each such sample.

There might be a command that does this already, but you might search with split sample as keywords.

A simple sketch to illustrate the steps follows (not tested, so there may be errors you need to troubleshoot). Here I divide the sample into 5 random subsets. There are potentially many ways to do this, and here I use the modulus function combined with the _n auto-variable which represents the observation number.

Code:

* At the start of your do file, set the seed set seed 17 *... Later in your do file gen randu = rnormal() sort randu gen group = mod(_n, 5) + 1
1 like
Comment
Karel Novak

Join Date: Mar 2018

Posts: 37
#3

26 Dec 2021, 11:23

Thank you for your advice. I've searched for manulas to the relevant mentioned commands and found out that -rnormal ()- creates "standard normal (Gaussian) random variates, that is, variates from a normal distribution with a mean of 0 and a standard deviation of 1", is it something that could be considered as common practice? May it be the case that command -runiform()- creates more equal groups? I am also having difficult time to understand the purpose of -set seed-. I have tried to run the commands without it and the results looked promising. Am I missing something? Thanks for your time.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#4

26 Dec 2021, 15:25

Hi Karel,

The use of any continuous random number generator here is only to randomly sort observations -- the choice of distribution is not important, so long as the random numbers are continuous. To that end, a random normal or random uniform will not produce more or less equally distributed groups. Related to this is a random number seed. The seed number tells the random number generator the starting position to begin from when generating random numbers, and it is what allows the results to be full reproducible when you re-run your do file at some later point in time. You don't need it, but then every time you run the code you will have different results, and that would be madness for serious work.

The step that actually makes the groups is the very last line. Like I said, you can partition groups in several ways, this is just one of them. This method above, if you inspect the first dozen or so elements, will show you that it assigns observations to each group, one at a time taking turns from 1 to N (here, N=5).
1 like
Comment

Announcement

Creating N subpopulations by random sampling without replacement

Comment

Comment

Comment