Anonymising an evolving Dataset

Pavlos Petroulas

Join Date: Sep 2024

Posts: 2
#1

Anonymising an evolving Dataset

25 Sep 2024, 04:55

Dear all,
I have a large unbalanced panel dataset -about 3 million observations per month for about 6 years- and would like to anonymise two variables. A firm specific id and a worker specific id. To anonymize the current data I could do something like below which I found in some threads:

keep worker_id
sort worker_id
by worker_id: keep if _n==1

set seed 123456

gen double random = uniform()
sort random
bysort random : assert _N == 1
gen long new_worker_id=_n
sort afm_id_r
save mapping_pid, replace

and then merge the two datasets and keep the new worker id. Similarly for firm id's

However, the dataset is dynamic. New incoming data will be on a monthly frequency and firms and workers may enter and/or exit over time. Preferably, for each new month of data that I get, I would like to anonymize only the incoming data (as the full panel will become overly large and cumbersome to handle over time) and ensure that the new anonymized id's are consistent with the old ones. Would something like the above code work?

Many thanks,

Pavlos
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3805
#2

25 Sep 2024, 11:55

Assuming that the original worker-specific ID from the source file(s) uniquely identifies workers and is constant over time, you can get the number of observations, i.e., workers, from your mapping file, then merge with the incoming data as

Code:

use mapping_pid.dta local N = c(N) merge 1:m worker_id using incoming_data.dta , keep(using)

Next, sort the newly incoming worker IDs randomly, using the same code you suggested but start the randomized ID from the number of observations, i.e., workers you already have

Code:

sort worker_id by worker_id : keep if _n == 1 set seed 42 generate double random = uniform() sort random by random : assert _N == 1 generate long = _n + `N' keep worker_id new_worker_id

Last, update your mapping file

Code:

merge 1:1 worker_id new_worker_id using mapping_pid , assert(master using) // no matches save mapping_pid.dta , replace

Obviously, larger values of new_worker_id implies later panel entries. If that's a problem, you need a different approach.

Last edited by daniel klein; 25 Sep 2024, 11:58.
Comment
Pavlos Petroulas

Join Date: Sep 2024

Posts: 2
#3

26 Sep 2024, 01:35

Dear Daniel,
Many thanks for this.
Best,
P.
Comment

Announcement

Anonymising an evolving Dataset

Comment

Comment