Splitting the dataset into groups of 799 observations based on a variable

Alessandra Quintigliano

Join Date: Jul 2024

Posts: 3
#1

Splitting the dataset into groups of 799 observations based on a variable

26 Jul 2024, 05:05

Hi all,

I have a dataset of 250,000 observations and there is a variable called countrynum providing a numeric code for the country.

I need to split the observations into groups of less than 800, because I then need to apply a command that only runs on less than 800 observations at a time. Whenever a country has less than 800 observations, I'm fine using countrynum as identifier: I temporarily keep if contrynum = i and run the command on that subset. However, certain countries have far more than 800 observations.

I would like to create an identifier that assigns a unique value to each subset of less than 800 observations (be it a whole country or a partition of it).In that way, I can keep if identifier = i and run the command on each subset separately.

For the whole-country part, I simply copy countrynum into a variable called identifier: gen identifier = 0, bysort countrynum: egen freq = count(countrynum), and replace identifier = countrynum if freq < 800 .

For the parition part, I would like to split the observations of each country in groups of less than 800, and assign a unique value in "identifier" to each subset.

Does anyone have ideas?

Thank you very much.
Tags: None
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1127
#2

26 Jul 2024, 05:20

Code:

bys countrynum: gen identifier = floor(_n/800)
Comment

Announcement