Random sampling according to group in Stata

Meng JI

Join Date: May 2021

Posts: 77
#1

Random sampling according to group in Stata

31 Mar 2022, 20:53

Hi everyone,

I have a question about how to randomly sample data in Stata according to specific groups. Below is my data structure:

I want to randomly sample 1/3 of the data according to the ids. For instance, if I randomly choose id 3, then I'll keep all the observations of id3, in my case 4 obs.
Similarly, if I randomly choose id 1, then I want to keep all the observations of id 1.

Does anyone know how to realize this in Stata?

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte id str9 date 1 "2022/1/2" 1 "2022/1/8" 1 "2022/1/9" 1 "2022/1/12" 1 "2022/1/13" 1 "2022/1/14" 2 "2022/1/3" 2 "2022/1/6" 2 "2022/2/1" 2 "2022/2/8" 2 "2022/2/11" 3 "2022/2/6" 3 "2022/2/9" 3 "2022/2/13" 3 "2022/2/15" end

Thanks a lot!
Tags: None

Lizzy Padhi

Join Date: Mar 2022
Posts: 5

31 Mar 2022, 21:20

If you're fine with each ID having an equal chance of getting selected, you could reshape to wide and then sample.

Code:

clear
input byte id str9 date
1 "2022/1/2" 
1 "2022/1/8" 
1 "2022/1/9" 
1 "2022/1/12"
1 "2022/1/13"
1 "2022/1/14"
2 "2022/1/3" 
2 "2022/1/6" 
2 "2022/2/1" 
2 "2022/2/8" 
2 "2022/2/11"
3 "2022/2/6" 
3 "2022/2/9" 
3 "2022/2/13"
3 "2022/2/15"
end


bysort id: gen index = _n
reshape wide date, i(id) j(index)
sample 2, count 
reshape long
drop index
drop if date==""

Comment

Meng JI

Join Date: May 2021
Posts: 77

31 Mar 2022, 21:23

Originally posted by Lizzy Padhi View Post

If you're fine with each ID having an equal chance of getting selected, you could reshape to wide and then sample.

Code:

clear
input byte id str9 date
1 "2022/1/2"
1 "2022/1/8"
1 "2022/1/9"
1 "2022/1/12"
1 "2022/1/13"
1 "2022/1/14"
2 "2022/1/3"
2 "2022/1/6"
2 "2022/2/1"
2 "2022/2/8"
2 "2022/2/11"
3 "2022/2/6"
3 "2022/2/9"
3 "2022/2/13"
3 "2022/2/15"
end


bysort id: gen index = _n
reshape wide date, i(id) j(index)
sample 2, count
reshape long
drop index
drop if date==""

Hi Lizzy,

Thank you so much for your help. The code works perfectly for me.
Cheers!

Comment

Meng JI

Join Date: May 2021

Posts: 77
#4

31 Mar 2022, 21:46

Originally posted by Lizzy Padhi View Post

If you're fine with each ID having an equal chance of getting selected, you could reshape to wide and then sample.

Code:

clear input byte id str9 date 1 "2022/1/2" 1 "2022/1/8" 1 "2022/1/9" 1 "2022/1/12" 1 "2022/1/13" 1 "2022/1/14" 2 "2022/1/3" 2 "2022/1/6" 2 "2022/2/1" 2 "2022/2/8" 2 "2022/2/11" 3 "2022/2/6" 3 "2022/2/9" 3 "2022/2/13" 3 "2022/2/15" end bysort id: gen index = _n reshape wide date, i(id) j(index) sample 2, count reshape long drop index drop if date==""

Hi Lizzy,

Sorry I have an additional question: The data I showed you is just a sample of my large data. In my real data, there are around 100 other variables such as quantity, price that are associated with ID and they could be different across time.

When I applied the code to my real data, I encounter error like below: I wonder if you know how to fix this issue? The goal is still to keep all the obs of randomly selected IDs.

Thanks a lot!
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#5

01 Apr 2022, 08:26

I would choose to do this without reshaping. The general strategy is to temporarily reduce the file to one observation per id, select a sample and create a marker variable, save that sample to a file, and merge back onto the original file.

Code:

set seed 47642 // whatever you like preserve bysort id: keep if _n ==1 // one observation per id sample 33 // percent gen byte insample = 1 // a variable to mark the chosen IDs keep id insample tempfile temp save `temp' restore // merge m:1 id using `temp' list, nolabel // see what this did keep if insample ==1
Comment
Meng JI

Join Date: May 2021

Posts: 77
#6

04 Apr 2022, 19:38

Originally posted by Mike Lacy View Post

I would choose to do this without reshaping. The general strategy is to temporarily reduce the file to one observation per id, select a sample and create a marker variable, save that sample to a file, and merge back onto the original file.

Code:

set seed 47642 // whatever you like preserve bysort id: keep if _n ==1 // one observation per id sample 33 // percent gen byte insample = 1 // a variable to mark the chosen IDs keep id insample tempfile temp save `temp' restore // merge m:1 id using `temp' list, nolabel // see what this did keep if insample ==1

Hi Mike,

The code works perfectly for me. Thank you so much for your help!

Best
Meng
Comment

Announcement

Random sampling according to group in Stata

Comment

Comment

Comment

Comment

Comment