Generating random ID for patients in the dataset

Sam Lin

Join Date: Mar 2015

Posts: 14
#1

Generating random ID for patients in the dataset

06 Jun 2024, 17:49

Dear Stata Experts,

I'm trying to generate a random ID for patients in my datasets so we can de-identify patients. I know that I could use this random function below to generate a five-digit random number. However, I'm trying to run this line of code across various datasets that may contain the same patient. Is there a way to retain the same random ID for the same patient across the dataset? Or is there a way to set a seed for each patient in the dataset using the patient's unique identifier (e.g. MRN, person number...etc.)? Any advice would be appreciated.

Code:

gen random_int=runiformint(10000, 50000)

Thanks!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

06 Jun 2024, 18:41

If you want coordination across data sets, the code has to apply to all of the data sets at once. So something like this:

Code:

clear tempfile building save `building', emptyok forvalues i = 1/5 { use id using dataset_`i', clear append using `building' save `"`building'"', replace } duplicates drop set seed 12345 // OR WHATEVER SEED YOU LIKE gen masked_id = runiformint(10000, 50000) isid masked_id sort id save id_crosswalk, replace forvalues i = 1/5 { use dataset_`i', clear merge m:1 id using id_crosswalk, assert(match using) keep(match) nogenerate drop id order masked_id, first save masked_dataset_`i', replace }

Note: I'm not really fond of using -runiformint(10000, 50000)- to create the masked id's. If your data sets collectively contain only a few thousand distinct id's, then fine. But if it runs into the 10's of thousands, you risk having the same masked_id assigned to two different id's. It is for this reason that I included -isid masked_id- in the code. If that line halts with an error message, you have encountered this problem.

But rather than detect the problem, I would consider using larger bounds on -runiformint()- to avoid that problem. Or, better still, sort the id's into random order and then just make the masked_id be consecutive integers. That part of the code would be:

Code:

duplicates drop set seed 12345 // OR WHATEVER SEED YOU LIKE gen double shuffle = runiform() sort shuffle gen `c(obs_t)' masked_id = _n drop shuffle
1 like
Comment
Sam Lin

Join Date: Mar 2015

Posts: 14
#3

07 Jun 2024, 16:16

Clyde Schechter

Clyde, thank you so much for providing the codes for generating the mask ID. It makes a lot of sense to me to apply the codes to the dataset all at once to ensure consistency of mask ID across datasets. However, we have to extract the dataset sequentially so we may not be able to apply the codes all at once. I also agree that runiformint(10000, 50000) may potentially create repeated mask ID for different individuals. We were trying to use a shorter ID (around 5 digits) so it's easier to cross-check patient's data with other data sources - isid command will definitely come in handy!

Given that we are limited to sequential dataset extraction, I found a wrapper utilizing the HASH function to generate the alphanumeric mask ID. This helps us maintain the same ID across datasets; however, the mask ID is a very long string of alphanumeric characters (e.g., b3b5d8e3e56a8e82ca3f032102293beb40361f96). I know we can try to truncate the ID to make it shorter (maybe 6-9 characters) using "substring" function and check uniqueness using "isid." Do you know of a better way to do this in Stata?

Thank you so much for your assistance!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

07 Jun 2024, 17:08

Take a look at https://www.statalist.org/forums/for...shing-a-string. It's a long thread (make sure you look at the second page) and it offers several approaches, most of which are potentially suitable for your purposes and avoid using unreasonably long hashes that bloat data sets and greatly slow down execution of subsequent data management and analysis. By the way, notwithstanding the thread's title, only some of the methods outlined there are based on hashing. Some of them are similar to what I have proposed, but can be used when the data set is built sequentially.
Comment
Sam Lin

Join Date: Mar 2015

Posts: 14
#5

10 Jun 2024, 13:07

Clyde Schechter

Hi Clyde,
Thank you so much for referring to the previously discussed thread. The discussions there are extremely helpful.
Comment

Announcement

Generating random ID for patients in the dataset

Comment

Comment

Comment

Comment