Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Randomly deleting observations by a group id

    Dear Statalist users,
    I have a dataset relationship between households as given below. variables are household id (hhid_2013), the id of the household to whom household is connected (net_hhid_2013) and the relationship by which households are connected (relation_code). a household can connect at most to 5 households for each relationship. but in my data set some households are connected to more than 5 households for each relationship code. i want to randomly drop the observations so that each household can have at most 5 connections for each relationship. please note that I do not want to drop observations systematically, please keep deletion process random. Can some one please help me to do this.
    Thanks

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float relation_code int(hhid_2013 net_hhid_2013)
    1.1 273  62
    1.1 273  99
    1.1 273 241
    1.1 273  26
    1.1 278 225
    1.1 278 170
    1.1 278   .
    1.1 278 146
    1.1 278 176
    1.2 287 174
    1.2 287   .
    1.2 287 288
    1.2 287 177
    1.2 287  75
    1.1 303  78
    1.1 303 200
    1.1 303  50
    1.1 303 167
    1.1 303   .
    1.1 311   .
    1.1 311  46
    1.1 311 208
    1.1 311 234
    1.1 311 310
    1.1  18  96
    1.1  18 237
    1.1  18  51
    1.1  18 130
    1.1  18  53
    1.1  18   .
    1.2  18   .
    1.2  18  96
    1.2  18  51
    1.2  18 130
    1.2  18  53
    1.2  18 237
    1.1  43   .
    1.1  43  20
    1.1  43  53
    1.1  43  59
    1.1  43   9
    1.1  43  34
    1.1  73 131
    1.1  73  75
    1.1  73  72
    1.1  73   .
    1.1  73 102
    1.1  73  86
    1.2  73 102
    1.2  73  86
    1.2  73   .
    1.2  73  72
    1.2  73  75
    1.2  73 131
    1.2  86 123
    1.2  86  49
    1.2  86  94
    1.2  86 105
    1.2  86 249
    1.2  86 312
    1.1 120 271
    1.1 120 121
    1.1 120 118
    1.1 120 110
    1.1 120   .
    1.1 120 119
    1.2 120 118
    1.2 120 121
    1.2 120 119
    1.2 120 110
    1.2 120 271
    1.2 120   .
    1.1 143  92
    1.1 143  62
    1.1 143 192
    1.1 143  34
    1.1 143   .
    1.1 143 226
    1.1 145 293
    1.1 145  51
    1.1 145  22
    1.1 145 270
    1.1 145 173
    1.1 145   .
    1.1 158  19
    1.1 158  18
    1.1 158   .
    1.1 158 254
    1.1 158 199
    1.1 158 237
    1.2 159 203
    1.2 159  18
    1.2 159 293
    1.2 159 235
    1.2 159 305
    1.2 159  97
    1.1 161  19
    1.1 161 309
    1.1 161 217
    1.1 161 243
    end

  • #2
    You have some missing connections (net_hhid_2013), which I will drop first. Sorting on a random variable and selecting the first 5 sorted observations in a group will provide the random selection you seek.

    Code:
    gen selector= rnormal()
    drop if missing(net_hhid_2013)
    bys hhid_2013 relation_code (selector): keep if _n<=5
    drop selector

    Comment


    • #3
      Thanks Andrew, i think it will be not a reproducible results. because every time it will create different samples.

      Comment


      • #4
        Set the seed if you want to reproduce the same results. The point of random selection is that the selection is random.

        Code:
        set seed 02082024
        gen selector= rnormal()
        drop if missing(net_hhid_2013)
        bys hhid_2013 relation_code (selector): keep if _n<=5
        drop selector

        Comment

        Working...
        X