Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Apologies for the confusion. It seems my explanation in #13 is not clear regarding the "unattractiveness of the random solution". What I mean is that such consecutive newid would be dependent on the original (ID) order of the original list and thus not "stable".

    Below's the illustration for a simple case whereby a dropping of a single (original ID) will cause the change in the newid (with random method) while could not affect the hashed_id). That explains why a hash() method (that could create the uniqueness) would have an "attractive" advantage in comparison to the random method.

    Whether such a hash solution could be found within Stata?
    Code:
    clear
    input str4 ID
    "7950"
    "3226"
    "6448"
    "8660"
    "9455"
    "2096"
    "2184"
    "2442"
    "3174"
    "5045"
    "1708"
    "7167"
    "8333"
    "7696"
    "5878"
    end
    
    mata:
    mata set matastrict on
    void function cvt(string scalar varname) {
        real scalar index
        index = st_addvar("double", "hashed_" + varname)
        st_varformat(index, "%10.0f")
        real matrix Input
        pragma unset Input
        st_sview(Input, ., varname)
        real scalar row
        for (row=1; row<=rows(Input); row++) {
            st_store(row, index, hash1(Input[row, 1]))
        }
    }
    end
    
    mata: cvt("ID")
    
    set seed 517794135
    generate double randu = runiform()
    generate str nid = string(_n, "%07.0f")
    
    drop if _n == 14
    ren (hashed_ID randu nid) =_1
    
    mata: cvt("ID")
    
    set seed 517794135
    generate double randu = runiform()
    generate str nid = string(_n, "%07.0f")
    
    assert hashed_ID_1 == hashed_ID
    
    assert nid_1 == nid
    assertion is false

    Comment


    • #17
      Originally posted by Diana Yoko View Post
      Below's the illustration for a simple case whereby a dropping of a single (original ID) will cause the change in the newid (with random method) . . .
      When you create the crosswalk table, you do it only once. You don't re-create it after dropping an ID. So, the nids don't change.

      Be aware that if you're not salting your hashes and if there are standards for the IDs (for example, if your nine-digit IDs are U.S. Social Security Numbers), then anyone can create a list of such IDs, hash them using the hash function, and merge the list of hashes against your dataset to obtain the original IDs.

      Such reverse engineering isn't doable with a crosswalk table that has randomized assignment of the new IDs. (You neglected to randomize, by the way, in your example above: you need to sort on the random number before assigning new IDs to the rows.).

      Comment


      • #18
        Joseph Coveney

        (Many thanks for your suggestion on utilizing `salt' to increase confidentiality and pointing out the lack of randomizing in my example.)

        The data (with around 1 million IDs) is published by year with minor changes in the ID list: some are dropped, some are added. Thus, it seems that "re-masking" IDs for every year data is needed. I do seek for a consistent masked_ID over year to each single ID, instead of its (randomized) rank in the data, which changes every year. That explains the desire for a direct masking method to IDs. So far, there seems a limitation of values of the function hash1(), which restraints the method to create a uniqueness for a large number of IDs.


        Comment


        • #19
          Diana, if the ID structure remains the same - always 9 digits - and the IDs for an individual are consistent if they appear from year to year, then Joseph’s method is consistent and fully reproducible (including some added salt).

          You haven’t said one way or another, but if you are exporting data from an actual database (e.g., SQL) you may consider adding the hash in that system prior to export.

          Comment


          • #20
            Code:
            //-------------------------------------------------------------------------
            // create some example data
            clear all
            frames reset
            frame create y2023
            frame create y2024
            
            frame change y2023
            input id y x
            147258369 1 3
            123456789 2 4
            end
            format id %12.0g
            
            frame change y2024
            input id y x
            987654321 9 7
            147258369 6 4
            end
            format id %12.0g
            
            /*
            So:
            - person 147258369 appears both years
            - person 123456789 appears only in 2023
            - person 987654321 appears only in 2024
            
            ---------------------------------------------------------------------------
            next step:
            we assum we are in 2023 and make our crosswalk file
            
            To ensure confidentiality we want the sort order to be as upredictable
            as possible. To do that we need to set the seed once, to a unpredictable number
            
            I picked a random bill from my wallet and took the last 6 numbers of its
            registration number
            */
            set seed 941599
            
            frame copy y2023 crosswalk
            frame change crosswalk
            keep id
            gen double u1 = runiform()
            gen double u2 = runiform()
            sort u1 u2
            gen newid = _n
            drop u1 u2
            
            /*
            This file is very important:
            - you need to make sure it does not get lost
            - you need to make sure it does not get in the wrong hands
            
            We are still in 2023 and now prepare the pulic use file
            */
            
            frame change y2023
            frlink 1:1 id , frame(crosswalk)
            frget newid, from(crosswalk)
            drop id crosswalk
            list
            
            /*
            ---------------------------------------------------------------------------
            Now we are in 2024, and new data has become available
            Some people dropped out of the study and there was a refresher sample, so
            new people came into the sample.
            That is not a problem
            */
            
            frame change y2024
            frlink 1:1 id, frame(crosswalk)
            frget newid, from(crosswalk)
            list
            
            /*
            OK, so we accurately recovered the newid from 2023 if it existed
            Now we need to make an additional id for the new person
            
            we need fframeappend for that, which we can get by typing:
            ssc install fframeappend
            */
            
            frame change crosswalk
            fframeappend id if newid == . , using(y2024)
            list
            
            /*
            So now we add the new person to our crosswalk file,  and need to assign
            it a newid while keeping the old ones intact
            So we sort by the newid (missing values will appear last) and random
            numbers, so we keep the sort order of the old ones and the new ones will be random at the end
            
            A year has passed, so we need to set the seed again using the same trick
            to remain unpredictable
            */
            set seed 777362
            gen double u1 = runiform()
            gen double u2 = runiform()
            sort newid u1 u2
            list
            replace newid = _n
            list
            drop u1 u2
            
            /*
            This is now the updated crosswalk file.
            The newid does now contain some identifying information: large numbers
            are more likely to be a person from a refresher sample
            But the id does not contain any other information
            (as long as you keep the crosswalk file safe)
            
            Now we can make our public use file for year 2024
            */
            
            frame change y2024
            frlink rebuild crosswalk
            drop newid
            frget newid,from(crosswalk)
            drop crosswalk
            list
            ---------------------------------
            Maarten L. Buis
            University of Konstanz
            Department of history and sociology
            box 40
            78457 Konstanz
            Germany
            http://www.maartenbuis.nl
            ---------------------------------

            Comment


            • #21
              Maarten Buis It is very much admirable and appreciated for your comprehensive code and guidance. At this moment, your idea appears novel and clear to me: utilizing randomization to create a masked ID; keep it safe; then just deal with new comers in new years. I will try it on my actual data and share the feedback here soon. Many thanks.

              Comment

              Working...
              X