Creating unique IDs

Laura C

Join Date: Jun 2014

Posts: 41
#1

Creating unique IDs

07 Dec 2015, 12:45

Hi everyone.

I have some datasets that contain first names and birthdays. In order to anonymize the data I would like to create numeric IDs and remove the first names. The trick is that these IDs have to be determined in such a way that they would allow me to match them across datasets (multiple surveys). I know how to create IDs using group, or encode but the resulting IDs would not be the same across datasets.
Assuming that the combination of first name, birth date, and one other variable -a string would create a unique ID, is there any quick way of doing this in Stata? Or Excel if easier.

Many thanks.

Laura

Regards,
Laura Cojocaru
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35445
#2

07 Dec 2015, 12:55

I don't think first names and birthdays will anonymise and adding a third variable won't affect that. I think you would some encryption too.
Comment
Laura C

Join Date: Jun 2014

Posts: 41
#3

07 Dec 2015, 13:03

I think I did not properly explain what I wanted to do.
I want to create a numeric ID that would replace first name and birth date. I discuss the third variable because the combination of first name and birth date would not be unique (many Michaels might have been born on the same date).

Regards,
Laura Cojocaru
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35445
#4

07 Dec 2015, 13:09

Indeed, and that's consistent with my point too. If your mapping to numeric is reversible, it's not anonymising.
Comment
Laura C

Join Date: Jun 2014

Posts: 41
#5

07 Dec 2015, 13:14

I see. This data will not be public, but rather just shared with a few researchers who will be under a non-disclosure contract. So let's call it masking? I could also add some numbers to the resulting IDs (e.g. 298 if the name starts with A-D, etc.) I realize it is not exactly encrypting but the data is not that sensitive.

Regards,
Laura Cojocaru
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35445
#6

07 Dec 2015, 13:20

That convinces me. Your workplace may have more stringent rules.
Comment
Sarah Edgington

Join Date: Apr 2014

Posts: 284
#7

07 Dec 2015, 14:05

One strategy is to create an arbitrary ID number. In this case the important part is that you retain a crosswalk between the three relevant variables and the IDs (and protect it in whatever way satisfies the conditions set forth by the promises you made to your respondents and your IRB when you made the data).

You can create an ID that's a random integer. You can create an ID based on the ordering of the data in your first dataset. It doesn't much matter if you keep a separate dataset that contains the ID number and the three variables that create a unique ID. Then you can merge that id number dataset to all subsequent datasets, using your 3 key variables. Then drop the key variables and you have a dataset where individuals are identified only by their ID numbers.

You could also, of course, come up with a more complicated set of rules for creating the ID numbers from the inputs of the identifying variables. That's going to take more effort, though, and is arguably less secure in that either way you have to have a crosswalk in your possession (either the dataset with the IDs linked to names or some code/documentation of the rules so that you can recreate them later) but if the IDs are arbitrary a user cannot reverse engineer them to get back to an identifiable dataset.
1 like
Comment
Jeph Herrin

Join Date: Apr 2014

Posts: 332
#8

07 Dec 2015, 14:26

I've usually used Sarah's approach - create the IDs randomly in a "master" list and then export a crosswalk file that can be used to code the others. If I have all the files in advance, I open each, keep the ID variables, append to a master list, code the master list using some random process, then use this to code each file.

bys firstname ID: keep if _n==1
set seed 20151207
gen sort=uniform()
sort sort
gen id_random=_n
save xwalk, replace

Then merge in the file -xwalk- to your other files to get the same random assignment.

hth
Jeph
Comment
Laura C

Join Date: Jun 2014

Posts: 41
#9

07 Dec 2015, 14:36

Great ideas! Many thanks!

Regards,
Laura Cojocaru
Comment

Announcement

Creating unique IDs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment