Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating a unique id from string

    Hi everyone,

    I have a panel data from 2000-2020 at the factory level. The factory ID is a string variable and I am trying to create a numeric id for the factory.

    Here is the code that I tried:
    Code:
    bysort FACTORYId : gen id = _N
    And this is how it looks:
    Code:
     * Example generated by -dataex-. For more info, type help dataex clear input str50 FACTORYId float id "0000006F" 1 "0000008B" 2 "0000008B" 2 "0000022F" 1 "0000023F" 1 "0000033F" 5 "0000033F" 5 "0000033F" 5 "0000033F" 5 "0000033F" 5 "0000103F" 1 "0000108B" 3 "0000108B" 3 "0000108B" 3 "0000123F" 2 "0000123F" 2 "0000132F" 2 "0000132F" 2 "0000133F" 4 "0000133F" 4 "0000133F" 4 "0000133F" 4 "0000206F" 5 "0000206F" 5 "0000206F" 5 "0000206F" 5 "0000206F" 5 "0000207F" 2 "0000207F" 2 "0000208B" 6 "0000208B" 6 "0000208B" 6 "0000208B" 6 "0000208B" 6 "0000208B" 6 "0000209F" 2 "0000209F" 2 "0000219F" 2 "0000219F" 2 "0000222F" 1 "0000306F" 1 "0000308B" 3 "0000308B" 3 "0000308B" 3 "0000310F" 3 "0000310F" 3 "0000310F" 3 "0000322F" 1 "0000323F" 1 "0000332F" 1 "0000333F" 3 "0000333F" 3 "0000333F" 3 "0000406F" 1 "0000407F" 2 "0000407F" 2 "0000408B" 5 "0000408B" 5 "0000408B" 5 "0000408B" 5 "0000408B" 5 "0000420F" 2 "0000420F" 2 "0000522F" 3 "0000522F" 3 "0000522F" 3 "0000523F" 1 "0000532F" 1 "0000607F" 1 "0000622F" 1 "0000623F" 4 "0000623F" 4 "0000623F" 4 "0000623F" 4 "0000706F" 1 "0000708B" 8 "0000708B" 8 "0000708B" 8 "0000708B" 8 "0000708B" 8 "0000708B" 8 "0000708B" 8 "0000708B" 8 "0000723F" 1 "0000733F" 2 "0000733F" 2 "0000803F" 1 "0000806F" 4 "0000806F" 4 "0000806F" 4 "0000806F" 4 "0000807F" 1 "0000820F" 2 "0000820F" 2 "0000823F" 5 "0000823F" 5 "0000823F" 5 "0000823F" 5 "0000823F" 5 "0000832F" 1 end
    For some reason, this is not generating unique ids. I thought maybe the issue is it's generating a float variable but even after I try gen long, the problem remains.

  • #2
    this code:

    Code:
    bysort FACTORYId : gen id = _N
    will set id to the total number of values (_N) within each group of FACTORYId that are the same. Since the number of FACTORYId's that are the same can be equal across different FACTORYId's, you don't get a unique identifier.

    try:

    Code:
    encode FACTORYId, gen(id)
    or even:

    Code:
    egen id = group(FACTORYId)
    Last edited by Daniel Schaefer; 30 Oct 2023, 21:45.

    Comment


    • #3
      Hi Daniel,

      Thank you so much! The second one worked.

      Comment


      • #4
        Hi Sarita,

        I bet the first one worked too. The -encode- command keeps the original string as the label for the new numeric value, so you might not be able to see the underlying numeric value with some commands. Look for the nolab option on (e.g.) -tab-. Notice after using -encode- each new observation is blue in the data browser, not red. The blue color signifies that you are seeing a label and not a value. I am entirely sure that, as far as Stata is concerned, the underlying value is a number.
        Last edited by Daniel Schaefer; 31 Oct 2023, 00:13.

        Comment


        • #5
          Hi Daniel,

          The first one i.e. "encode" ran into an error - too many variables.

          Comment


          • #6
            Strictly, the limit on encode is the number of distinct values. The number of variables to encode is always 1.

            Comment


            • #7
              Hello!

              I just want to add to this conversation that the encode option can cause issues if you have a large enough dataset, due to precision. Encode does not know to use a highly precise format (like double), so I think you can actually mess up the data and give different people the same id. The better way to do it is to use egen:

              Code:
              egen double id =group(FACTORYId)
              This does the same thing but is safe in very large datasets.

              Comment


              • #8
                Bruce McDougall is right to warn that encode has limits, but I don't think they are a matter of precision.

                First, encode maps to integers and by default starts at 1 although other choices are possible.

                The limit on encode is the number of distinct value labels allowed, which is at present 65536 (2^16). You would have problems fitting integers all the way up to 65536 into a numeric variable unless you were using float, long or double, but it's I imagine rare that people try to do anything else.

                Any way, encode just refuses to define more than 65536 labels. So, I am not sure what Bruce is referring to, as he seems to be reporting that encode messes up your data, or produces a messy interpretation of it.

                I endorse the advice that egen, group() can be a good way to get numeric identifiers.

                Comment


                • #9
                  Hi Nick

                  Apologies for being a bit vague, partly this is because I can't remember the exact scenario this came up. I may have been using a bysort-type solution or even a solution based on tag. Either way, the important thing is that the problem manifested itself something like the following:
                  _n id number (string, also happened to be encrypted) id
                  65535 ASDLKAGLKJA 65535
                  65536 OITLJDLKLLKS 65536
                  65537 OITLJDLKLLKS 65536
                  65538 UTKAJSFKLKL 65536
                  (The id_number is the string input and the id is the output with the error included.)

                  Notice that there is a point where the system stops working. This was a little scary as I didn't get an error and thought I had a usable identifier. I didn't realist the bottom row was erroneously being given the same id as the second last. Again, from memory I think I solved it by improving the precision.

                  So, the lesson for me was to use the egen (group) solution with double (definitely!) specified. I hope this clarifies the "messing up" I was referring to, even if I am not 100% sure how I got there!

                  Best,
                  Bruce

                  Comment


                  • #10
                    If you don't want to think about what data type is correct for this type of exercise, then use the system macro -c(obs_t)-, which tell Stata to pick the optimal data type to hold the value of _n. Said another way, this the smallest type needed to hold the integers from 1 to _N which will always guarantee you will have a unique integer ID using something like -egen, group()-.

                    For example

                    Code:
                    egen `c(obs_t)' id = group(stringid)

                    Comment


                    • #11
                      #9 It would be good (or at least helpful) to see a reproducible example in which you create a dataset ab initio in which the same numeric identifier is associated with different value labels. We can agree that that should not happen.

                      Comment


                      • #12
                        Originally posted by Nick Cox View Post
                        #9 It would be good (or at least helpful) to see a reproducible example in which you create a dataset ab initio in which the same numeric identifier is associated with different value labels. We can agree that that should not happen.
                        An example would be good to see. I doubt that this happened using -encode-, since it would exit with an error complaining of too many values.

                        Comment


                        • #13
                          Indeed. That is my guess too. as in #8.

                          Comment


                          • #14
                            Seems like encode complains about too many values in Stata 18.

                            Code:
                            clear
                            set obs 65536
                            gen id_string = string(_n)
                            encode id_string, gen(id)
                            Code:
                            . clear
                            
                            . set obs 65536
                            Number of observations (_N) was 0, now 65,536.
                            
                            . gen id_string = string(_n)
                            
                            . encode id_string, gen(id)
                            
                            . list if _n >= 65535, clean noobs
                            
                                id_str~g      id  
                                   65535   65535  
                                   65536   65536  
                            
                            . 
                            end of do-file
                            Code:
                            clear
                            set obs 65537
                            gen id_string = string(_n)
                            encode id_string, gen(id)
                            list if _n >= 65535, clean noobs
                            Code:
                            . clear
                            
                            . set obs 65537
                            Number of observations (_N) was 0, now 65,537.
                            
                            . gen id_string = string(_n)
                            
                            . encode id_string, gen(id)
                            too many values
                            r(134);
                            
                            end of do-file

                            Comment


                            • #15
                              Hello! Here is a reproducible example that shows the problem using

                              egen x=group(y) without specifying the precision:

                              Code:
                              clear all
                              set obs 100000000
                              gen double n=_n
                              
                              tostring n, replace
                              
                              local numbers "1 2 3 4 5 6 7 8 9 0"
                              local letters "a b c d e f g h i j"
                              
                              forvalues i=1/10 {
                              local letter: word `i' of `numbers'
                              local number:word `i' of `letters'
                              replace n=subinstr(n, "`letter'","`number'",.)
                              }
                              
                              egen id=group(n)
                              
                              isid id
                              Gives the error: variable id does not uniquely identify the observations.

                              At the bottom of the dataset we on row 99999998 and row 99999999 we have:
                              row n id
                              99999998 iiiiiiij 9.88e+07
                              99999999 iiiiiii 9.88e+07

                              Comment

                              Working...
                              X