Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating random ID for patients in the dataset

    Dear Stata Experts,

    I'm trying to generate a random ID for patients in my datasets so we can de-identify patients. I know that I could use this random function below to generate a five-digit random number. However, I'm trying to run this line of code across various datasets that may contain the same patient. Is there a way to retain the same random ID for the same patient across the dataset? Or is there a way to set a seed for each patient in the dataset using the patient's unique identifier (e.g. MRN, person number...etc.)? Any advice would be appreciated.

    Code:
    gen random_int=runiformint(10000, 50000)

    Thanks!

  • #2
    If you want coordination across data sets, the code has to apply to all of the data sets at once. So something like this:

    Code:
    clear
    tempfile building
    save `building', emptyok
    forvalues i = 1/5 {
        use id using dataset_`i', clear
        append using `building'
        save `"`building'"', replace
    }
    
    duplicates drop
    set seed 12345 // OR WHATEVER SEED YOU LIKE
    gen masked_id = runiformint(10000, 50000)
    isid masked_id
    
    sort id
    save id_crosswalk, replace
    
    forvalues i = 1/5 {
        use dataset_`i', clear
        merge m:1 id using id_crosswalk, assert(match using) keep(match) nogenerate
        drop id
        order masked_id, first
        save masked_dataset_`i', replace
    }
    Note: I'm not really fond of using -runiformint(10000, 50000)- to create the masked id's. If your data sets collectively contain only a few thousand distinct id's, then fine. But if it runs into the 10's of thousands, you risk having the same masked_id assigned to two different id's. It is for this reason that I included -isid masked_id- in the code. If that line halts with an error message, you have encountered this problem.

    But rather than detect the problem, I would consider using larger bounds on -runiformint()- to avoid that problem. Or, better still, sort the id's into random order and then just make the masked_id be consecutive integers. That part of the code would be:
    Code:
    duplicates drop
    set seed 12345 // OR WHATEVER SEED YOU LIKE
    gen double shuffle = runiform()
    sort shuffle
    gen `c(obs_t)' masked_id = _n
    drop shuffle

    Comment


    • #3
      Clyde Schechter

      Clyde, thank you so much for providing the codes for generating the mask ID. It makes a lot of sense to me to apply the codes to the dataset all at once to ensure consistency of mask ID across datasets. However, we have to extract the dataset sequentially so we may not be able to apply the codes all at once. I also agree that runiformint(10000, 50000) may potentially create repeated mask ID for different individuals. We were trying to use a shorter ID (around 5 digits) so it's easier to cross-check patient's data with other data sources - isid command will definitely come in handy!

      Given that we are limited to sequential dataset extraction, I found a wrapper utilizing the HASH function to generate the alphanumeric mask ID. This helps us maintain the same ID across datasets; however, the mask ID is a very long string of alphanumeric characters (e.g., b3b5d8e3e56a8e82ca3f032102293beb40361f96). I know we can try to truncate the ID to make it shorter (maybe 6-9 characters) using "substring" function and check uniqueness using "isid." Do you know of a better way to do this in Stata?

      Thank you so much for your assistance!

      Comment


      • #4
        Take a look at https://www.statalist.org/forums/for...shing-a-string. It's a long thread (make sure you look at the second page) and it offers several approaches, most of which are potentially suitable for your purposes and avoid using unreasonably long hashes that bloat data sets and greatly slow down execution of subsequent data management and analysis. By the way, notwithstanding the thread's title, only some of the methods outlined there are based on hashing. Some of them are similar to what I have proposed, but can be used when the data set is built sequentially.

        Comment


        • #5
          Clyde Schechter

          Hi Clyde,
          Thank you so much for referring to the previously discussed thread. The discussions there are extremely helpful.

          Comment

          Working...
          X