Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hashing a string

    Hi Everyone,

    Is there a way to hash a string using SHA256 or other similar algorithms which gives us unique IDs for the strings. Something like:
    Code:
     local x = sha256("I want to convert this text to an ID using SHA256")
    The output should be storable in a local and be used in .do and .ado programs.

    Thanks,
    Amit

  • #2
    Code:
    tempname x
    mata: st_numscalar("`x'", hash1("I want to convert this text to an ID using Jenkins's one-at-a-time hash function"))
    di %12.0g `x'
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      This functionality is not built-in to Stata; Jenkins's one-at-a-time hash is not the same as SHA256. The former is not cryptographic while the latter is. To find a solution reference this post which provides two methods; the first (from Bjarte Aagnes) uses Java, the second uses my shell wrapper inshell (available on the SSC) and the shasum command line utility, which is embedded on all Mac and Linux systems.

      Comment


      • #4
        Originally posted by Matthew Hall View Post
        Jenkins's one-at-a-time hash is not the same as SHA256. The former is not cryptographic while the latter is.

        The original post asked for

        Originally posted by Amit Narnoli View Post
        [...] SHA256 or other similar algorithms which gives us unique IDs for the strings.
        (emphasis mine)

        We cannot know what "similar" means in this context but I believe that hash1() is probably able to provide unique IDs; the need for Java (or Python) or any other workaround is not apparent. Still, it is nice to have different options available. I have added Python and will only point to hashlib. I do not currently have access to Python but it should not be harder to implement than using Java.

        Comment


        • #5
          Kindly teach me how to use hash1() to "encrypt" unique IDs. Your advice is very much appreciated.
          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str4 ID
          "7950"
          "3226"
          "6448"
          "8660"
          "9455"
          "2096"
          "2184"
          "2442"
          "3174"
          "5045"
          "1708"
          "7167"
          "8333"
          "7696"
          "5878"
          end

          Comment


          • #6
            Originally posted by Diana Yoko View Post
            Kindly teach me how to use hash1() to "encrypt" unique IDs.
            Wouldn't it be just a take-off of #2 above?

            .ÿ
            .ÿversionÿ18.0

            .ÿ
            .ÿclearÿ*

            .ÿ
            .ÿquietlyÿinputÿstr4ÿID

            .ÿ
            .ÿlocalÿline_sizeÿ`c(linesize)'

            .ÿsetÿlinesizeÿ80

            .ÿ
            .ÿmata:
            -------------------------------------------------ÿmataÿ(typeÿendÿtoÿexit)ÿ------
            :ÿmataÿsetÿmatastrictÿon

            :ÿ
            :ÿvoidÿfunctionÿcvt(stringÿscalarÿvarname)ÿ{
            >ÿÿÿÿÿÿÿÿÿrealÿscalarÿindex
            >ÿÿÿÿÿÿÿÿÿindexÿ=ÿst_addvar("double",ÿ"hashed_"ÿ+ÿvarname)
            >ÿÿÿÿÿÿÿÿÿst_varformat(index,ÿ"%10.0f")
            >ÿ
            >ÿÿÿÿÿÿÿÿÿrealÿmatrixÿInput
            >ÿÿÿÿÿÿÿÿÿpragmaÿunsetÿInput
            >ÿÿÿÿÿÿÿÿÿst_sview(Input,ÿ.,ÿvarname)
            >ÿ
            >ÿÿÿÿÿÿÿÿÿrealÿscalarÿrow
            >ÿÿÿÿÿÿÿÿÿforÿ(row=1;ÿrow<=rows(Input);ÿrow++)ÿ{
            >ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿst_store(row,ÿindex,ÿhash1(Input[row,ÿ1]))
            >ÿÿÿÿÿÿÿÿÿ}
            >ÿ}

            :ÿ
            :ÿend
            --------------------------------------------------------------------------------

            .ÿ
            .ÿsetÿlinesizeÿ`line_size'

            .ÿ
            .ÿ//ÿIllustrated:
            .ÿmata:ÿcvt("ID")

            .ÿ
            .ÿlist,ÿnoobsÿseparator(0)

            ÿÿ+-------------------+
            ÿÿ|ÿÿÿIDÿÿÿÿhashed_IDÿ|
            ÿÿ|-------------------|
            ÿÿ|ÿ7950ÿÿÿÿ929759440ÿ|
            ÿÿ|ÿ3226ÿÿÿÿÿÿ9335711ÿ|
            ÿÿ|ÿ6448ÿÿÿ3934256883ÿ|
            ÿÿ|ÿ8660ÿÿÿÿ453586097ÿ|
            ÿÿ|ÿ9455ÿÿÿÿ197326183ÿ|
            ÿÿ|ÿ2096ÿÿÿ1675437978ÿ|
            ÿÿ|ÿ2184ÿÿÿ1265921504ÿ|
            ÿÿ|ÿ2442ÿÿÿ1384281312ÿ|
            ÿÿ|ÿ3174ÿÿÿÿ235598821ÿ|
            ÿÿ|ÿ5045ÿÿÿÿ717063912ÿ|
            ÿÿ|ÿ1708ÿÿÿÿ917571839ÿ|
            ÿÿ|ÿ7167ÿÿÿ3326418111ÿ|
            ÿÿ|ÿ8333ÿÿÿ2807370344ÿ|
            ÿÿ|ÿ7696ÿÿÿÿ100363521ÿ|
            ÿÿ|ÿ5878ÿÿÿÿ462755454ÿ|
            ÿÿ+-------------------+

            .ÿisidÿhashed_ID

            .ÿ
            .ÿexit

            endÿofÿdo-file


            .
            Code:
            mata:
            mata set matastrict on
            
            void function cvt(string scalar varname) {
                real scalar index
                index = st_addvar("double", "hashed_" + varname)
                st_varformat(index, "%10.0f")
            
                real matrix Input
                pragma unset Input
                st_sview(Input, ., varname)
            
                real scalar row
                for (row=1; row<=rows(Input); row++) {
                    st_store(row, index, hash1(Input[row, 1]))
                }
            }
            
            end

            Comment


            • #7
              Many thanks, Joseph. However, I try the code but nothing happens. Please instruct me.

              Code:
              clear
              input str4 ID
              "7950"
              "3226"
              "6448"
              "8660"
              "9455"
              "2096"
              "2184"
              "2442"
              "3174"
              "5045"
              "1708"
              "7167"
              "8333"
              "7696"
              "5878"
              end
              
              mata:
              mata set matastrict on
              void function cvt(string scalar varname) {
                  real scalar index
                  index = st_addvar("double", "hashed_" + varname)
                  st_varformat(index, "%10.0f")
                  real matrix Input
                  pragma unset Input
                  st_sview(Input, ., varname)
                  real scalar row
                  for (row=1; row<=rows(Input); row++) {
                      st_store(row, index, hash1(Input[row, 1]))
                  }
              }
              end

              Comment


              • #8
                Perhaps a more streamlined approach would be to consider using R

                Code:
                rcall: library(digest)
                rcall: digest("This is my string.", algo="sha256", serialize=FALSE, raw=TRUE)
                Output:
                e8 19 45 1e 85 bd f0 0e 9b c3 83 9c 5c 0b fc 17 5d d1 f9 39 ac 75 c4 ce 00 da 2e 31 e4 c9 84 48

                Comment


                • #9
                  Originally posted by Diana Yoko View Post
                  I try the code but nothing happens. Please instruct me.
                  The code I posted just defined the function. So, after running the code that I showed, you still need to call the function from Stata, giving the name of the variable as its argument (surrounded by double-quotation marks).

                  Like this:
                  Code:
                  mata: cvt("ID")
                  Please examine the output that I posted; the function call is just after the comment "// Illustrated:".

                  Comment


                  • #10
                    Joseph Coveney It is very much appreciated. The code works well and fast for my actual data with around 1 million observations.

                    Comment


                    • #11
                      The code appears working smoothly.

                      However, I could not get the uniqueness of the Hashed_ID in my actual database, with around 1 million distinct IDs (all with 9 digits number). I have tried to revise the format up to "%20.0f" but the problem could not be resolved. Please advise me any solution for that.

                      Code:
                      isid ID
                      
                      isid Hashed_ID
                      variable Hashed_ID does not uniquely identify the observations

                      Comment


                      • #12
                        To build on Joseph's example, it would appear that -hash1()- may not be suitable for generating unique hashes (at least in this setting), but I'm not an expert with this. The simulation below generates some random 9-digit IDs and is sufficient to show collisions.

                        That said, if you already have a unique ID, which your code in #11 suggests, why are you needing another, different unique ID?

                        Code:
                        clear *
                        set obs 1000000
                        gen ID = string(100000000 + _n - 1, "%21.0f")
                        
                        mata:
                        mata set matastrict on
                        
                        void function cvt(string scalar varname) {
                            real scalar index
                            index = st_addvar("str32", "hashed_" + varname)
                            st_varformat(index)
                        
                            real matrix Input
                            pragma unset Input
                            st_sview(Input, ., varname)
                        
                            real scalar row
                            for (row=1; row<=rows(Input); row++) {
                                st_sstore(row, index, strofreal(hash1(Input[row, 1]), "%21.0f") )
                            }
                        }
                        
                        end
                        
                        mata: cvt("ID")
                        duplicates report hashed_ID
                        duplicates tag hashed_ID, gen(dupe)
                        gsort -dupe hashed_ID ID
                        list in 1/20
                        Relevant result:

                        Code:
                        . duplicates report hashed_ID
                        
                        Duplicates in terms of hashed_ID
                        
                        --------------------------------------
                           Copies | Observations       Surplus
                        ----------+---------------------------
                                1 |       998890             0
                                2 |         1110           555
                        --------------------------------------

                        Comment


                        • #13
                          Originally posted by Diana Yoko View Post
                          However, I could not get the uniqueness of the Hashed_ID in my actual database, with around 1 million distinct IDs (all with 9 digits number).
                          Yeah, I suspected that collisions will happen if your dataset is large enough, and that's why I sneaked the isid line in there.

                          As far as solutions, you could do as asarray() does and create a duplicates column, or you could use some other hashing function. A couple of alternatives for the latter have been mentioned already upthread, but they are programmatically more involved.

                          Based upon your initial post ("to 'encrypt' unique IDs"), you seem to be interested primarily in disguising the ID values. The easiest way to accomplish that is through a so-called crosswalk table: run this code on a dataset that contains a complete list of your IDs.
                          Code:
                          set seed 517794135 // Recommended to aid in reproducibility (Use whatever value suits you)
                          
                          contract ID, freq(count)
                          
                          generate double randu = runiform()
                          isid randu, sort
                          
                          generate str nid = string(_n, "%07.0f") // nid = New ID
                          isid nid
                          drop count randu
                          Once saved, the dataset is then a crosswalk table that can be used via, say, merge, whenever you want to disguise or recover the original ID values. (Creation of a crosswalk table is probably what gave you your IDs in the first place.)

                          Comment


                          • #14
                            Leonardo Guizzetti The need of "encrypting ID", as exactly mentioned by Joseph Coveney, is "masking" the original IDs, which could not be provided to a third-party for the confidential reason. In this context, unique "masked" IDs are needed so that original IDs could be referred to whenever necessary.

                            In my limited knowledge, hash() function has been effectively utilized in R (as discussed by Shen YANG in #8) or even in SQL to deal with similar issue. However, it has not been discussed much in Stata. As I guess, a reason could be the capacity of Stata to conveniently generate a random (unique?) ID (with seed), as illustrated in #13.

                            A weakness of the solution using random and referring to row number (i.e. _n) is that, if the original list of ID changes (i.e., having more, or dropping some IDs from the list), the corresponding "masked" ID will change. Note that the solution with hash() - if working properly to provide the uniqueness - could avoid this weakness since it directly "encrypts" the corresponding IDs themselves. Thus, a solution with hash() is still a desire, especially when confidentiality is a top priority, for which a consecutive new_ID by row number appears not attractive.

                            Please advise me whether any solution with hash() (or any type of encryption) for the uniqueness is feasible in Stata?

                            Thank you all for your comprehensive and valuable guidance.

                            Comment


                            • #15
                              Originally posted by Diana Yoko View Post
                              . . . if the original list of ID changes (i.e., having more, or dropping some IDs from the list), the corresponding "masked" ID will change.
                              Not dropping, no. The New IDs won't change for those original ID values that remain in the list.

                              But if you anticipate adding IDs, then one solution is to exhaust all possibilities in advance. You mentioned earlier, "around 1 million distinct IDs (all with 9 digits number)". If the ID values are nine digits (numerical), then you could enumerate them, sort them randomly as above and assign new IDs to the rows. It's an ironclad solution, but with 999,999,999 possible ID values it'll take awhile.

                              An alternative to create a list of New IDs that is the randomized sequence and is something larger than the ultimate number of ID values that you anticipate. In this case you wouldn't sort the original ID values on the basis of their string values, but rather enter them into the crosswalk table in order of the sequence in which you first encounter them.

                              . . . a solution with hash() is still a desire, especially when confidentiality is a top priority, for which a consecutive new_ID by row number appears not attractive.
                              Wouldn't randomizing the row sequence render the original IDs unpredictable from the new IDs? Why is that unattractive here?

                              Comment

                              Working...
                              X