Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to sample panel data by ID while keeping the length of Panel?

    Hi All,

    I have a large balanced panel dataset in hand, and I want to sample the data by its ID while keeping the panel feature of the dataset.
    I know I should use the "sample" command, but I am not sure how could I keep the panel feature of the data.

    I want to sample according to individuals in the dataset, and if the person is sampled, I want to keep all their data across the different periods.

    I could only think of a cumbersome way: is to duplicate drop by idcode first, then sampled the unique idcode, and then save the data to later merge with the original dataset if matched. This could take very long for my large dataset, so I am wondering if any of you have a better idea.

    Thank you so much for your help,
    Alex


    For example, I want to sample 5% according to the idcode, but if the idcode is sampled, I would like to retain all its periods' data.
    webuse nlswork, clear
    sort idcode year
    list in 1/50

    idcode year

    1. 1 70
    2. 1 71
    3. 1 72
    4. 1 73
    5. 1 75

    6. 1 77
    7. 1 78
    8. 1 80
    9. 1 83
    10. 1 85

    11. 1 87
    12. 1 88
    13. 2 71
    14. 2 72
    15. 2 73

    16. 2 75
    17. 2 77
    18. 2 78
    19. 2 80
    20. 2 82

    21. 2 83
    22. 2 85
    23. 2 87
    24. 2 88
    25. 3 68

    26. 3 69
    27. 3 70
    28. 3 71
    29. 3 72
    30. 3 73

    31. 3 75
    32. 3 77
    33. 3 78
    34. 3 80
    35. 3 82

    36. 3 83
    37. 3 85
    38. 3 87
    39. 3 88
    40. 4 70

    41. 4 71
    42. 4 72
    43. 4 73
    44. 4 75
    45. 4 80



  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(idcode year)
    1 70
    1 71
    1 72
    1 73
    1 75
    1 77
    1 78
    1 80
    1 83
    1 85
    1 87
    1 88
    2 71
    2 72
    2 73
    2 75
    2 77
    2 78
    2 80
    2 82
    2 83
    2 85
    2 87
    2 88
    3 68
    3 69
    3 70
    3 71
    3 72
    3 73
    3 75
    3 77
    3 78
    3 80
    3 82
    3 83
    3 85
    3 87
    3 88
    4 70
    4 71
    4 72
    4 73
    4 75
    4 80
    end
    
    set seed 1234 // OR ANY OTHER INTEGER YOU LIKE
    gen double shuffle = runiform()
    by idcode (year), sort: gen byte in_sample = shuffle[1] < 0.05
    will designate all of the observation in an idcode as 1 or 0 for in_sample. To use your 5% sample, whatever you want to do with the sample, you do -if in_sample- tacked onto each command.

    That said, if your data set is truly large, do not underestimate the amount of time it will take for Stata to process the -if in_sample- clause for every observation at every command. If the operations to be performed on the sample are numerous, with many lines of code, it could conceivably prove slower than the keep, save, and merge approach you are trying to improve on.

    If you are going to be repeatedly sampling, and if you are using version 16 or later, another approach would be:

    Code:
    use dataset, clear
    set seed 1234
    
    local n_reps 1000 // PROCESS 1000 5% RANDOM SAMPLES
    
    preserve
    forvalues i = 1/`n_reps' {
        gen double shuffle = runiform()
        by idcode (year), sort: keep if shuffle[1] < 0.05
        // CODE TO WORK WITH THE SAMPLE GOES HERE
        restore, preserve
    }
    The use of -preserve- and -restore, preserve- assures that the data set will only have to be written out once: it is retained as still -preserve-d with each -restore-. So instead of 1000 read/writes you only have 1000 reads. Better still, if you have adequate active memory available, with version 16 or later, the -preserve-d file is written to RAM, not to disk, so it is pretty fast to initially -preserve- it and to -restore- it in each iteration of the loop.

    Added: In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      In addition to Clyde's solution, you may still use -sample- to randomly draw panel IDs. The idea is that, firstly you tag the first observation of each ID, then randomly draw 5% of the tagged observations (untagged observations remain in Stata), and finally keep all observations of the sampled IDs.

      Code:
      webuse nlswork, clear
      sort idcode year
      
      egen tag = tag(idcode)
      
      sample 5 if tag
      
      bys idcode (year): keep if tag[1]
      Last edited by Fei Wang; 13 Jun 2022, 19:33.

      Comment


      • #4
        Thank you Clyde and Fei!

        Comment

        Working...
        X