Hashing a string - Statalist

Diana Yoko

Join Date: Aug 2019
Posts: 26

#16

05 Jul 2023, 05:08

Apologies for the confusion. It seems my explanation in #13 is not clear regarding the "unattractiveness of the random solution". What I mean is that such consecutive newid would be dependent on the original (ID) order of the original list and thus not "stable".

Below's the illustration for a simple case whereby a dropping of a single (original ID) will cause the change in the newid (with random method) while could not affect the hashed_id). That explains why a hash() method (that could create the uniqueness) would have an "attractive" advantage in comparison to the random method.

Whether such a hash solution could be found within Stata?

Code:

clear
input str4 ID
"7950"
"3226"
"6448"
"8660"
"9455"
"2096"
"2184"
"2442"
"3174"
"5045"
"1708"
"7167"
"8333"
"7696"
"5878"
end

mata:
mata set matastrict on
void function cvt(string scalar varname) {
    real scalar index
    index = st_addvar("double", "hashed_" + varname)
    st_varformat(index, "%10.0f")
    real matrix Input
    pragma unset Input
    st_sview(Input, ., varname)
    real scalar row
    for (row=1; row<=rows(Input); row++) {
        st_store(row, index, hash1(Input[row, 1]))
    }
}
end

mata: cvt("ID")

set seed 517794135
generate double randu = runiform()
generate str nid = string(_n, "%07.0f")

drop if _n == 14
ren (hashed_ID randu nid) =_1

mata: cvt("ID")

set seed 517794135
generate double randu = runiform()
generate str nid = string(_n, "%07.0f")

assert hashed_ID_1 == hashed_ID

assert nid_1 == nid
assertion is false

Comment

Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#17

05 Jul 2023, 06:36

Originally posted by Diana Yoko View Post

Below's the illustration for a simple case whereby a dropping of a single (original ID) will cause the change in the newid (with random method) . . .

When you create the crosswalk table, you do it only once. You don't re-create it after dropping an ID. So, the nids don't change.

Be aware that if you're not salting your hashes and if there are standards for the IDs (for example, if your nine-digit IDs are U.S. Social Security Numbers), then anyone can create a list of such IDs, hash them using the hash function, and merge the list of hashes against your dataset to obtain the original IDs.

Such reverse engineering isn't doable with a crosswalk table that has randomized assignment of the new IDs. (You neglected to randomize, by the way, in your example above: you need to sort on the random number before assigning new IDs to the rows.).
Comment
Diana Yoko

Join Date: Aug 2019

Posts: 26
#18

05 Jul 2023, 16:26

Joseph Coveney

(Many thanks for your suggestion on utilizing `salt' to increase confidentiality and pointing out the lack of randomizing in my example.)

The data (with around 1 million IDs) is published by year with minor changes in the ID list: some are dropped, some are added. Thus, it seems that "re-masking" IDs for every year data is needed. I do seek for a consistent masked_ID over year to each single ID, instead of its (randomized) rank in the data, which changes every year. That explains the desire for a direct masking method to IDs. So far, there seems a limitation of values of the function hash1(), which restraints the method to create a uniqueness for a large number of IDs.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#19

05 Jul 2023, 16:44

Diana, if the ID structure remains the same - always 9 digits - and the IDs for an individual are consistent if they appear from year to year, then Joseph’s method is consistent and fully reproducible (including some added salt).

You haven’t said one way or another, but if you are exporting data from an actual database (e.g., SQL) you may consider adding the hash in that system prior to export.
Comment

Maarten Buis

Join Date: Mar 2014
Posts: 3426

#20

06 Jul 2023, 02:12

Code:

//-------------------------------------------------------------------------
// create some example data
clear all
frames reset
frame create y2023
frame create y2024

frame change y2023
input id y x
147258369 1 3
123456789 2 4
end
format id %12.0g

frame change y2024
input id y x
987654321 9 7
147258369 6 4
end
format id %12.0g

/*
So:
- person 147258369 appears both years
- person 123456789 appears only in 2023
- person 987654321 appears only in 2024

---------------------------------------------------------------------------
next step:
we assum we are in 2023 and make our crosswalk file

To ensure confidentiality we want the sort order to be as upredictable
as possible. To do that we need to set the seed once, to a unpredictable number

I picked a random bill from my wallet and took the last 6 numbers of its
registration number
*/
set seed 941599

frame copy y2023 crosswalk
frame change crosswalk
keep id
gen double u1 = runiform()
gen double u2 = runiform()
sort u1 u2
gen newid = _n
drop u1 u2

/*
This file is very important:
- you need to make sure it does not get lost
- you need to make sure it does not get in the wrong hands

We are still in 2023 and now prepare the pulic use file
*/

frame change y2023
frlink 1:1 id , frame(crosswalk)
frget newid, from(crosswalk)
drop id crosswalk
list

/*
---------------------------------------------------------------------------
Now we are in 2024, and new data has become available
Some people dropped out of the study and there was a refresher sample, so
new people came into the sample.
That is not a problem
*/

frame change y2024
frlink 1:1 id, frame(crosswalk)
frget newid, from(crosswalk)
list

/*
OK, so we accurately recovered the newid from 2023 if it existed
Now we need to make an additional id for the new person

we need fframeappend for that, which we can get by typing:
ssc install fframeappend
*/

frame change crosswalk
fframeappend id if newid == . , using(y2024)
list

/*
So now we add the new person to our crosswalk file,  and need to assign
it a newid while keeping the old ones intact
So we sort by the newid (missing values will appear last) and random
numbers, so we keep the sort order of the old ones and the new ones will be random at the end

A year has passed, so we need to set the seed again using the same trick
to remain unpredictable
*/
set seed 777362
gen double u1 = runiform()
gen double u2 = runiform()
sort newid u1 u2
list
replace newid = _n
list
drop u1 u2

/*
This is now the updated crosswalk file.
The newid does now contain some identifying information: large numbers
are more likely to be a person from a refresher sample
But the id does not contain any other information
(as long as you keep the crosswalk file safe)

Now we can make our public use file for year 2024
*/

frame change y2024
frlink rebuild crosswalk
drop newid
frget newid,from(crosswalk)
drop crosswalk
list

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------

Comment

Diana Yoko

Join Date: Aug 2019

Posts: 26
#21

06 Jul 2023, 12:47

Maarten Buis It is very much admirable and appreciated for your comprehensive code and guidance. At this moment, your idea appears novel and clear to me: utilizing randomization to create a masked ID; keep it safe; then just deal with new comers in new years. I will try it on my actual data and share the feedback here soon. Many thanks.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment