Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Random sampling according to group in Stata

    Hi everyone,

    I have a question about how to randomly sample data in Stata according to specific groups. Below is my data structure:

    I want to randomly sample 1/3 of the data according to the ids. For instance, if I randomly choose id 3, then I'll keep all the observations of id3, in my case 4 obs.
    Similarly, if I randomly choose id 1, then I want to keep all the observations of id 1.

    Does anyone know how to realize this in Stata?


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte id str9 date
    1 "2022/1/2" 
    1 "2022/1/8" 
    1 "2022/1/9" 
    1 "2022/1/12"
    1 "2022/1/13"
    1 "2022/1/14"
    2 "2022/1/3" 
    2 "2022/1/6" 
    2 "2022/2/1" 
    2 "2022/2/8" 
    2 "2022/2/11"
    3 "2022/2/6" 
    3 "2022/2/9" 
    3 "2022/2/13"
    3 "2022/2/15"
    end
    Thanks a lot!

  • #2
    If you're fine with each ID having an equal chance of getting selected, you could reshape to wide and then sample.

    Code:
    clear
    input byte id str9 date
    1 "2022/1/2" 
    1 "2022/1/8" 
    1 "2022/1/9" 
    1 "2022/1/12"
    1 "2022/1/13"
    1 "2022/1/14"
    2 "2022/1/3" 
    2 "2022/1/6" 
    2 "2022/2/1" 
    2 "2022/2/8" 
    2 "2022/2/11"
    3 "2022/2/6" 
    3 "2022/2/9" 
    3 "2022/2/13"
    3 "2022/2/15"
    end
    
    
    bysort id: gen index = _n
    reshape wide date, i(id) j(index)
    sample 2, count 
    reshape long
    drop index
    drop if date==""

    Comment


    • #3
      Originally posted by Lizzy Padhi View Post
      If you're fine with each ID having an equal chance of getting selected, you could reshape to wide and then sample.

      Code:
      clear
      input byte id str9 date
      1 "2022/1/2"
      1 "2022/1/8"
      1 "2022/1/9"
      1 "2022/1/12"
      1 "2022/1/13"
      1 "2022/1/14"
      2 "2022/1/3"
      2 "2022/1/6"
      2 "2022/2/1"
      2 "2022/2/8"
      2 "2022/2/11"
      3 "2022/2/6"
      3 "2022/2/9"
      3 "2022/2/13"
      3 "2022/2/15"
      end
      
      
      bysort id: gen index = _n
      reshape wide date, i(id) j(index)
      sample 2, count
      reshape long
      drop index
      drop if date==""
      Hi Lizzy,

      Thank you so much for your help. The code works perfectly for me.
      Cheers!

      Comment


      • #4
        Originally posted by Lizzy Padhi View Post
        If you're fine with each ID having an equal chance of getting selected, you could reshape to wide and then sample.

        Code:
        clear
        input byte id str9 date
        1 "2022/1/2"
        1 "2022/1/8"
        1 "2022/1/9"
        1 "2022/1/12"
        1 "2022/1/13"
        1 "2022/1/14"
        2 "2022/1/3"
        2 "2022/1/6"
        2 "2022/2/1"
        2 "2022/2/8"
        2 "2022/2/11"
        3 "2022/2/6"
        3 "2022/2/9"
        3 "2022/2/13"
        3 "2022/2/15"
        end
        
        
        bysort id: gen index = _n
        reshape wide date, i(id) j(index)
        sample 2, count
        reshape long
        drop index
        drop if date==""
        Hi Lizzy,

        Sorry I have an additional question: The data I showed you is just a sample of my large data. In my real data, there are around 100 other variables such as quantity, price that are associated with ID and they could be different across time.

        When I applied the code to my real data, I encounter error like below: I wonder if you know how to fix this issue? The goal is still to keep all the obs of randomly selected IDs.

        Thanks a lot!

        Click image for larger version

Name:	3.PNG
Views:	1
Size:	60.1 KB
ID:	1657293

        Comment


        • #5
          I would choose to do this without reshaping. The general strategy is to temporarily reduce the file to one observation per id, select a sample and create a marker variable, save that sample to a file, and merge back onto the original file.

          Code:
          set seed 47642  // whatever you like
          preserve
          bysort id: keep if _n ==1  // one observation per id
          sample 33 // percent
          gen byte insample = 1  // a variable to mark the chosen IDs
          keep id insample
          tempfile temp
          save `temp'
          restore
          //
          merge m:1 id using `temp'
          list, nolabel  // see what this did
          keep if insample ==1

          Comment


          • #6
            Originally posted by Mike Lacy View Post
            I would choose to do this without reshaping. The general strategy is to temporarily reduce the file to one observation per id, select a sample and create a marker variable, save that sample to a file, and merge back onto the original file.

            Code:
            set seed 47642 // whatever you like
            preserve
            bysort id: keep if _n ==1 // one observation per id
            sample 33 // percent
            gen byte insample = 1 // a variable to mark the chosen IDs
            keep id insample
            tempfile temp
            save `temp'
            restore
            //
            merge m:1 id using `temp'
            list, nolabel // see what this did
            keep if insample ==1
            Hi Mike,

            The code works perfectly for me. Thank you so much for your help!

            Best
            Meng

            Comment

            Working...
            X