Generating a random sample from two subgroups

Rena Nakyeyune

Join Date: Aug 2023

Posts: 2
#1

Generating a random sample from two subgroups

29 Aug 2023, 07:13

Hello, I am somewhat new to STATA and I am trying to generate a random sample. Kindly assist.

My data is composed of 4 villages. Each village is composed of an undefined number of households. Each household is composed of an undefined number of children.
I need to create a random sample of 70 children from each village but they should not be from the same household.

I have read about the sample command but I cannot seem to figure out how to make sure a child from the same household isn't selected.

sort village
by village: sample 70, count
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

29 Aug 2023, 09:39

In general, it is difficult to provide help with code without having example data to work with. That is especially the case in situations like yours where the solution is unlikely to reduce to just one or two commands. Please post back using the -dataex- command to show example data. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 662
#3

29 Aug 2023, 10:21

As Clyde pointed out, without actual data examples this can be difficult. I tried to get an example for the most simple case. Note that this requires, of course, for the data to have at least 70 households in each village.

Code:

set seed 123 gen random = runiform() bysort village household (random): gen ID = _n keep if ID == 1 sample 70, count by(village)

The idea is as follows. We create a random number for each child and sort the data by village and household and random number. Then we select the first child in each household. As the sorting within each household is random, due to the random number, this is a random process. From the resulting pool of children we select 70 for each village.

Best wishes

(Stata 16.1 MP)
Comment
Rena Nakyeyune

Join Date: Aug 2023

Posts: 2
#4

01 Sep 2023, 03:59

Thank you Clyde and Felix. Due to some restrictions, I'm not able to share the data but the shared code has solved my issue. Thanks.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#5

01 Sep 2023, 09:45

I'm glad Felix's code has solved your issue.

I just want to make the point that access restrictions on your data do not preclude making use of -dataex-. For nearly all applications, the reason for requesting example data with -dataex- is to see the details of how the data is laid out and structured. The important issues are things like: is the layout long, wide, or some hybrid? Which variables are strings, and which are numeric? What are the data storage types of the variables? Which variables are nested within which (if any)? Things like that. The actual values of the variables are usually not important. So, for others who are following this thread, if you find yourself in a situation where you need help with code but are working with data that cannot be shared, you can make up a data set that looks just like the real thing, but with fake values, and then run -dataex- on that. You can do this by -use-ing the real data set, and then write some brief code that replaces all the variables with other (possibly random) values, just assuring that observations with the same value of a given variable get the same fake value for that variable.
1 like
Comment

Announcement

Generating a random sample from two subgroups

Comment

Comment

Comment

Comment