Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Anonymising an evolving Dataset

    Dear all,
    I have a large unbalanced panel dataset -about 3 million observations per month for about 6 years- and would like to anonymise two variables. A firm specific id and a worker specific id. To anonymize the current data I could do something like below which I found in some threads:

    keep worker_id
    sort worker_id
    by worker_id: keep if _n==1

    set seed 123456

    gen double random = uniform()
    sort random
    bysort random : assert _N == 1
    gen long new_worker_id=_n
    sort afm_id_r
    save mapping_pid, replace

    and then merge the two datasets and keep the new worker id. Similarly for firm id's

    However, the dataset is dynamic. New incoming data will be on a monthly frequency and firms and workers may enter and/or exit over time. Preferably, for each new month of data that I get, I would like to anonymize only the incoming data (as the full panel will become overly large and cumbersome to handle over time) and ensure that the new anonymized id's are consistent with the old ones. Would something like the above code work?

    Many thanks,

    Pavlos

  • #2
    Assuming that the original worker-specific ID from the source file(s) uniquely identifies workers and is constant over time, you can get the number of observations, i.e., workers, from your mapping file, then merge with the incoming data as
    Code:
    use mapping_pid.dta
    local N = c(N)
    merge 1:m worker_id using incoming_data.dta , keep(using)
    Next, sort the newly incoming worker IDs randomly, using the same code you suggested but start the randomized ID from the number of observations, i.e., workers you already have
    Code:
    sort worker_id
    by worker_id : keep if _n == 1
    set seed 42
    generate double random = uniform()
    sort random
    by random : assert _N == 1
    generate long = _n + `N'
    keep worker_id new_worker_id
    Last, update your mapping file
    Code:
    merge 1:1 worker_id new_worker_id using mapping_pid , assert(master using) // no matches
    save mapping_pid.dta , replace
    Obviously, larger values of new_worker_id implies later panel entries. If that's a problem, you need a different approach.
    Last edited by daniel klein; 25 Sep 2024, 11:58.

    Comment


    • #3
      Dear Daniel,
      Many thanks for this.
      Best,
      P.

      Comment

      Working...
      X