How to sample panel data by ID while keeping the length of Panel?

Alex Weng

Join Date: Jul 2017

Posts: 18
#1

How to sample panel data by ID while keeping the length of Panel?

13 Jun 2022, 15:58

Hi All,

I have a large balanced panel dataset in hand, and I want to sample the data by its ID while keeping the panel feature of the dataset.
I know I should use the "sample" command, but I am not sure how could I keep the panel feature of the data.

I want to sample according to individuals in the dataset, and if the person is sampled, I want to keep all their data across the different periods.

I could only think of a cumbersome way: is to duplicate drop by idcode first, then sampled the unique idcode, and then save the data to later merge with the original dataset if matched. This could take very long for my large dataset, so I am wondering if any of you have a better idea.

Thank you so much for your help,
Alex

For example, I want to sample 5% according to the idcode, but if the idcode is sampled, I would like to retain all its periods' data.
webuse nlswork, clear
sort idcode year
list in 1/50

idcode year

1. 1 70
2. 1 71
3. 1 72
4. 1 73
5. 1 75

6. 1 77
7. 1 78
8. 1 80
9. 1 83
10. 1 85

11. 1 87
12. 1 88
13. 2 71
14. 2 72
15. 2 73

16. 2 75
17. 2 77
18. 2 78
19. 2 80
20. 2 82

21. 2 83
22. 2 85
23. 2 87
24. 2 88
25. 3 68

26. 3 69
27. 3 70
28. 3 71
29. 3 72
30. 3 73

31. 3 75
32. 3 77
33. 3 78
34. 3 80
35. 3 82

36. 3 83
37. 3 85
38. 3 87
39. 3 88
40. 4 70

41. 4 71
42. 4 72
43. 4 73
44. 4 75
45. 4 80
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#2

13 Jun 2022, 16:51

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte(idcode year) 1 70 1 71 1 72 1 73 1 75 1 77 1 78 1 80 1 83 1 85 1 87 1 88 2 71 2 72 2 73 2 75 2 77 2 78 2 80 2 82 2 83 2 85 2 87 2 88 3 68 3 69 3 70 3 71 3 72 3 73 3 75 3 77 3 78 3 80 3 82 3 83 3 85 3 87 3 88 4 70 4 71 4 72 4 73 4 75 4 80 end set seed 1234 // OR ANY OTHER INTEGER YOU LIKE gen double shuffle = runiform() by idcode (year), sort: gen byte in_sample = shuffle[1] < 0.05

will designate all of the observation in an idcode as 1 or 0 for in_sample. To use your 5% sample, whatever you want to do with the sample, you do -if in_sample- tacked onto each command.

That said, if your data set is truly large, do not underestimate the amount of time it will take for Stata to process the -if in_sample- clause for every observation at every command. If the operations to be performed on the sample are numerous, with many lines of code, it could conceivably prove slower than the keep, save, and merge approach you are trying to improve on.

If you are going to be repeatedly sampling, and if you are using version 16 or later, another approach would be:

Code:

use dataset, clear set seed 1234 local n_reps 1000 // PROCESS 1000 5% RANDOM SAMPLES preserve forvalues i = 1/`n_reps' { gen double shuffle = runiform() by idcode (year), sort: keep if shuffle[1] < 0.05 // CODE TO WORK WITH THE SAMPLE GOES HERE restore, preserve }

The use of -preserve- and -restore, preserve- assures that the data set will only have to be written out once: it is retained as still -preserve-d with each -restore-. So instead of 1000 read/writes you only have 1000 reads. Better still, if you have adequate active memory available, with version 16 or later, the -preserve-d file is written to RAM, not to disk, so it is pretty fast to initially -preserve- it and to -restore- it in each iteration of the loop.

Added: In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#3

13 Jun 2022, 19:31

In addition to Clyde's solution, you may still use -sample- to randomly draw panel IDs. The idea is that, firstly you tag the first observation of each ID, then randomly draw 5% of the tagged observations (untagged observations remain in Stata), and finally keep all observations of the sampled IDs.

Code:

webuse nlswork, clear sort idcode year egen tag = tag(idcode) sample 5 if tag bys idcode (year): keep if tag[1]

Last edited by Fei Wang; 13 Jun 2022, 19:33.
Comment
Alex Weng

Join Date: Jul 2017

Posts: 18
#4

14 Jun 2022, 07:58

Thank you Clyde and Fei!
Comment

Announcement

How to sample panel data by ID while keeping the length of Panel?

Comment

Comment

Comment