Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating unique IDs

    Hi everyone.

    I have some datasets that contain first names and birthdays. In order to anonymize the data I would like to create numeric IDs and remove the first names. The trick is that these IDs have to be determined in such a way that they would allow me to match them across datasets (multiple surveys). I know how to create IDs using group, or encode but the resulting IDs would not be the same across datasets.
    Assuming that the combination of first name, birth date, and one other variable -a string would create a unique ID, is there any quick way of doing this in Stata? Or Excel if easier.

    Many thanks.

    Laura
    Regards,
    Laura Cojocaru

  • #2
    I don't think first names and birthdays will anonymise and adding a third variable won't affect that. I think you would some encryption too.

    Comment


    • #3
      I think I did not properly explain what I wanted to do.
      I want to create a numeric ID that would replace first name and birth date. I discuss the third variable because the combination of first name and birth date would not be unique (many Michaels might have been born on the same date).
      Regards,
      Laura Cojocaru

      Comment


      • #4
        Indeed, and that's consistent with my point too. If your mapping to numeric is reversible, it's not anonymising.

        Comment


        • #5
          I see. This data will not be public, but rather just shared with a few researchers who will be under a non-disclosure contract. So let's call it masking? I could also add some numbers to the resulting IDs (e.g. 298 if the name starts with A-D, etc.) I realize it is not exactly encrypting but the data is not that sensitive.
          Regards,
          Laura Cojocaru

          Comment


          • #6
            That convinces me. Your workplace may have more stringent rules.

            Comment


            • #7
              One strategy is to create an arbitrary ID number. In this case the important part is that you retain a crosswalk between the three relevant variables and the IDs (and protect it in whatever way satisfies the conditions set forth by the promises you made to your respondents and your IRB when you made the data).

              You can create an ID that's a random integer. You can create an ID based on the ordering of the data in your first dataset. It doesn't much matter if you keep a separate dataset that contains the ID number and the three variables that create a unique ID. Then you can merge that id number dataset to all subsequent datasets, using your 3 key variables. Then drop the key variables and you have a dataset where individuals are identified only by their ID numbers.

              You could also, of course, come up with a more complicated set of rules for creating the ID numbers from the inputs of the identifying variables. That's going to take more effort, though, and is arguably less secure in that either way you have to have a crosswalk in your possession (either the dataset with the IDs linked to names or some code/documentation of the rules so that you can recreate them later) but if the IDs are arbitrary a user cannot reverse engineer them to get back to an identifiable dataset.

              Comment


              • #8
                I've usually used Sarah's approach - create the IDs randomly in a "master" list and then export a crosswalk file that can be used to code the others. If I have all the files in advance, I open each, keep the ID variables, append to a master list, code the master list using some random process, then use this to code each file.

                bys firstname ID: keep if _n==1
                set seed 20151207
                gen sort=uniform()
                sort sort
                gen id_random=_n
                save xwalk, replace
                Then merge in the file -xwalk- to your other files to get the same random assignment.

                hth
                Jeph

                Comment


                • #9
                  Great ideas! Many thanks!
                  Regards,
                  Laura Cojocaru

                  Comment

                  Working...
                  X