Hashing a string

Amit Narnoli

Join Date: Jul 2016

Posts: 36
#1

Hashing a string

29 May 2023, 23:57

Hi Everyone,

Is there a way to hash a string using SHA256 or other similar algorithms which gives us unique IDs for the strings. Something like:

Code:

local x = sha256("I want to convert this text to an ID using SHA256")

The output should be storable in a local and be used in .do and .ado programs.

Thanks,
Amit
Tags: None
Maarten Buis

Join Date: Mar 2014

Posts: 3407
#2

30 May 2023, 03:08

Code:

tempname x mata: st_numscalar("`x'", hash1("I want to convert this text to an ID using Jenkins's one-at-a-time hash function")) di %12.0g `x'

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
2 likes
Comment
Matthew Hall

Join Date: Jun 2022

Posts: 29
#3

30 May 2023, 10:40

This functionality is not built-in to Stata; Jenkins's one-at-a-time hash is not the same as SHA256. The former is not cryptographic while the latter is. To find a solution reference this post which provides two methods; the first (from Bjarte Aagnes) uses Java, the second uses my shell wrapper inshell (available on the SSC) and the shasum command line utility, which is embedded on all Mac and Linux systems.
2 likes
Comment
daniel klein

Join Date: Mar 2014

Posts: 3805
#4

30 May 2023, 11:47

Originally posted by Matthew Hall View Post

Jenkins's one-at-a-time hash is not the same as SHA256. The former is not cryptographic while the latter is.

The original post asked for

Originally posted by Amit Narnoli View Post

[...] SHA256 or other similar algorithms which gives us unique IDs for the strings.

(emphasis mine)

We cannot know what "similar" means in this context but I believe that hash1() is probably able to provide unique IDs; the need for Java (or Python) or any other workaround is not apparent. Still, it is nice to have different options available. I have added Python and will only point to hashlib. I do not currently have access to Python but it should not be harder to implement than using Java.
1 like
Comment

Diana Yoko

Join Date: Aug 2019
Posts: 26

03 Jul 2023, 21:14

Kindly teach me how to use hash1() to "encrypt" unique IDs. Your advice is very much appreciated.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str4 ID
"7950"
"3226"
"6448"
"8660"
"9455"
"2096"
"2184"
"2442"
"3174"
"5045"
"1708"
"7167"
"8333"
"7696"
"5878"
end

Comment

Joseph Coveney

Join Date: Apr 2014

Posts: 4352
#6

03 Jul 2023, 23:00

Originally posted by Diana Yoko View Post

Kindly teach me how to use hash1() to "encrypt" unique IDs.

Wouldn't it be just a take-off of #2 above?

.ÿ
.ÿversionÿ18.0

.ÿ
.ÿclearÿ*

.ÿ
.ÿquietlyÿinputÿstr4ÿID

.ÿ
.ÿlocalÿline_sizeÿ`c(linesize)'

.ÿsetÿlinesizeÿ80

.ÿ
.ÿmata:
-------------------------------------------------ÿmataÿ(typeÿendÿtoÿexit)ÿ------
:ÿmataÿsetÿmatastrictÿon

:ÿ
:ÿvoidÿfunctionÿcvt(stringÿscalarÿvarname)ÿ{
>ÿÿÿÿÿÿÿÿÿrealÿscalarÿindex
>ÿÿÿÿÿÿÿÿÿindexÿ=ÿst_addvar("double",ÿ"hashed_"ÿ+ÿvarname)
>ÿÿÿÿÿÿÿÿÿst_varformat(index,ÿ"%10.0f")
>ÿ
>ÿÿÿÿÿÿÿÿÿrealÿmatrixÿInput
>ÿÿÿÿÿÿÿÿÿpragmaÿunsetÿInput
>ÿÿÿÿÿÿÿÿÿst_sview(Input,ÿ.,ÿvarname)
>ÿ
>ÿÿÿÿÿÿÿÿÿrealÿscalarÿrow
>ÿÿÿÿÿÿÿÿÿforÿ(row=1;ÿrow<=rows(Input);ÿrow++)ÿ{
>ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿst_store(row,ÿindex,ÿhash1(Input[row,ÿ1]))
>ÿÿÿÿÿÿÿÿÿ}
>ÿ}

:ÿ
:ÿend
--------------------------------------------------------------------------------

.ÿ
.ÿsetÿlinesizeÿ`line_size'

.ÿ
.ÿ//ÿIllustrated:
.ÿmata:ÿcvt("ID")

.ÿ
.ÿlist,ÿnoobsÿseparator(0)

ÿÿ+-------------------+
ÿÿ|ÿÿÿIDÿÿÿÿhashed_IDÿ|
ÿÿ|-------------------|
ÿÿ|ÿ7950ÿÿÿÿ929759440ÿ|
ÿÿ|ÿ3226ÿÿÿÿÿÿ9335711ÿ|
ÿÿ|ÿ6448ÿÿÿ3934256883ÿ|
ÿÿ|ÿ8660ÿÿÿÿ453586097ÿ|
ÿÿ|ÿ9455ÿÿÿÿ197326183ÿ|
ÿÿ|ÿ2096ÿÿÿ1675437978ÿ|
ÿÿ|ÿ2184ÿÿÿ1265921504ÿ|
ÿÿ|ÿ2442ÿÿÿ1384281312ÿ|
ÿÿ|ÿ3174ÿÿÿÿ235598821ÿ|
ÿÿ|ÿ5045ÿÿÿÿ717063912ÿ|
ÿÿ|ÿ1708ÿÿÿÿ917571839ÿ|
ÿÿ|ÿ7167ÿÿÿ3326418111ÿ|
ÿÿ|ÿ8333ÿÿÿ2807370344ÿ|
ÿÿ|ÿ7696ÿÿÿÿ100363521ÿ|
ÿÿ|ÿ5878ÿÿÿÿ462755454ÿ|
ÿÿ+-------------------+

.ÿisidÿhashed_ID

.ÿ
.ÿexit

endÿofÿdo-file

.

Code:

mata: mata set matastrict on void function cvt(string scalar varname) { real scalar index index = st_addvar("double", "hashed_" + varname) st_varformat(index, "%10.0f") real matrix Input pragma unset Input st_sview(Input, ., varname) real scalar row for (row=1; row<=rows(Input); row++) { st_store(row, index, hash1(Input[row, 1])) } } end
1 like
Comment

Diana Yoko

Join Date: Aug 2019
Posts: 26

04 Jul 2023, 00:43

Many thanks, Joseph. However, I try the code but nothing happens. Please instruct me.

Code:

clear
input str4 ID
"7950"
"3226"
"6448"
"8660"
"9455"
"2096"
"2184"
"2442"
"3174"
"5045"
"1708"
"7167"
"8333"
"7696"
"5878"
end

mata:
mata set matastrict on
void function cvt(string scalar varname) {
    real scalar index
    index = st_addvar("double", "hashed_" + varname)
    st_varformat(index, "%10.0f")
    real matrix Input
    pragma unset Input
    st_sview(Input, ., varname)
    real scalar row
    for (row=1; row<=rows(Input); row++) {
        st_store(row, index, hash1(Input[row, 1]))
    }
}
end

Comment

Shen YANG

Join Date: Apr 2023

Posts: 41
#8

04 Jul 2023, 01:16

Perhaps a more streamlined approach would be to consider using R

Code:

rcall: library(digest) rcall: digest("This is my string.", algo="sha256", serialize=FALSE, raw=TRUE)

Output:
e8 19 45 1e 85 bd f0 0e 9b c3 83 9c 5c 0b fc 17 5d d1 f9 39 ac 75 c4 ce 00 da 2e 31 e4 c9 84 48
1 like
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4352
#9

04 Jul 2023, 02:25

Originally posted by Diana Yoko View Post

I try the code but nothing happens. Please instruct me.

The code I posted just defined the function. So, after running the code that I showed, you still need to call the function from Stata, giving the name of the variable as its argument (surrounded by double-quotation marks).

Like this:

Code:

mata: cvt("ID")

Please examine the output that I posted; the function call is just after the comment "// Illustrated:".
1 like
Comment
Diana Yoko

Join Date: Aug 2019

Posts: 26
#10

04 Jul 2023, 06:46

Joseph Coveney It is very much appreciated. The code works well and fast for my actual data with around 1 million observations.
Comment
Diana Yoko

Join Date: Aug 2019

Posts: 26
#11

04 Jul 2023, 07:47

The code appears working smoothly.

However, I could not get the uniqueness of the Hashed_ID in my actual database, with around 1 million distinct IDs (all with 9 digits number). I have tried to revise the format up to "%20.0f" but the problem could not be resolved. Please advise me any solution for that.

Code:

isid ID isid Hashed_ID variable Hashed_ID does not uniquely identify the observations
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2371

#12

04 Jul 2023, 08:43

To build on Joseph's example, it would appear that -hash1()- may not be suitable for generating unique hashes (at least in this setting), but I'm not an expert with this. The simulation below generates some random 9-digit IDs and is sufficient to show collisions.

That said, if you already have a unique ID, which your code in #11 suggests, why are you needing another, different unique ID?

Code:

clear *
set obs 1000000
gen ID = string(100000000 + _n - 1, "%21.0f")

mata:
mata set matastrict on

void function cvt(string scalar varname) {
    real scalar index
    index = st_addvar("str32", "hashed_" + varname)
    st_varformat(index)

    real matrix Input
    pragma unset Input
    st_sview(Input, ., varname)

    real scalar row
    for (row=1; row<=rows(Input); row++) {
        st_sstore(row, index, strofreal(hash1(Input[row, 1]), "%21.0f") )
    }
}

end

mata: cvt("ID")
duplicates report hashed_ID
duplicates tag hashed_ID, gen(dupe)
gsort -dupe hashed_ID ID
list in 1/20

Relevant result:

Code:

. duplicates report hashed_ID

Duplicates in terms of hashed_ID

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |       998890             0
        2 |         1110           555
--------------------------------------

Comment

Joseph Coveney

Join Date: Apr 2014

Posts: 4352
#13

04 Jul 2023, 19:02

Originally posted by Diana Yoko View Post

However, I could not get the uniqueness of the Hashed_ID in my actual database, with around 1 million distinct IDs (all with 9 digits number).

Yeah, I suspected that collisions will happen if your dataset is large enough, and that's why I sneaked the isid line in there.

As far as solutions, you could do as asarray() does and create a duplicates column, or you could use some other hashing function. A couple of alternatives for the latter have been mentioned already upthread, but they are programmatically more involved.

Based upon your initial post ("to 'encrypt' unique IDs"), you seem to be interested primarily in disguising the ID values. The easiest way to accomplish that is through a so-called crosswalk table: run this code on a dataset that contains a complete list of your IDs.

Code:

set seed 517794135 // Recommended to aid in reproducibility (Use whatever value suits you) contract ID, freq(count) generate double randu = runiform() isid randu, sort generate str nid = string(_n, "%07.0f") // nid = New ID isid nid drop count randu

Once saved, the dataset is then a crosswalk table that can be used via, say, merge, whenever you want to disguise or recover the original ID values. (Creation of a crosswalk table is probably what gave you your IDs in the first place.)
1 like
Comment
Diana Yoko

Join Date: Aug 2019

Posts: 26
#14

04 Jul 2023, 19:55

Leonardo Guizzetti The need of "encrypting ID", as exactly mentioned by Joseph Coveney, is "masking" the original IDs, which could not be provided to a third-party for the confidential reason. In this context, unique "masked" IDs are needed so that original IDs could be referred to whenever necessary.

In my limited knowledge, hash() function has been effectively utilized in R (as discussed by Shen YANG in #8) or even in SQL to deal with similar issue. However, it has not been discussed much in Stata. As I guess, a reason could be the capacity of Stata to conveniently generate a random (unique?) ID (with seed), as illustrated in #13.

A weakness of the solution using random and referring to row number (i.e. _n) is that, if the original list of ID changes (i.e., having more, or dropping some IDs from the list), the corresponding "masked" ID will change. Note that the solution with hash() - if working properly to provide the uniqueness - could avoid this weakness since it directly "encrypts" the corresponding IDs themselves. Thus, a solution with hash() is still a desire, especially when confidentiality is a top priority, for which a consecutive new_ID by row number appears not attractive.

Please advise me whether any solution with hash() (or any type of encryption) for the uniqueness is feasible in Stata?

Thank you all for your comprehensive and valuable guidance.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4352
#15

04 Jul 2023, 22:34

Originally posted by Diana Yoko View Post

. . . if the original list of ID changes (i.e., having more, or dropping some IDs from the list), the corresponding "masked" ID will change.

Not dropping, no. The New IDs won't change for those original ID values that remain in the list.

But if you anticipate adding IDs, then one solution is to exhaust all possibilities in advance. You mentioned earlier, "around 1 million distinct IDs (all with 9 digits number)". If the ID values are nine digits (numerical), then you could enumerate them, sort them randomly as above and assign new IDs to the rows. It's an ironclad solution, but with 999,999,999 possible ID values it'll take awhile.

An alternative to create a list of New IDs that is the randomized sequence and is something larger than the ultimate number of ID values that you anticipate. In this case you wouldn't sort the original ID values on the basis of their string values, but rather enter them into the crosswalk table in order of the sequence in which you first encounter them.

. . . a solution with hash() is still a desire, especially when confidentiality is a top priority, for which a consecutive new_ID by row number appears not attractive.

Wouldn't randomizing the row sequence render the original IDs unpredictable from the new IDs? Why is that unattractive here?
1 like
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment