Generating a unique id from string

Sarita Vaswani

Join Date: Sep 2023
Posts: 14

Generating a unique id from string

30 Oct 2023, 19:41

Hi everyone,

I have a panel data from 2000-2020 at the factory level. The factory ID is a string variable and I am trying to create a numeric id for the factory.

Here is the code that I tried:

Code:

bysort FACTORYId : gen id = _N

And this is how it looks:

Code:

 * Example generated by -dataex-. For more info, type help dataex clear input str50 FACTORYId float id "0000006F" 1 "0000008B" 2 "0000008B" 2 "0000022F" 1 "0000023F" 1 "0000033F" 5 "0000033F" 5 "0000033F" 5 "0000033F" 5 "0000033F" 5 "0000103F" 1 "0000108B" 3 "0000108B" 3 "0000108B" 3 "0000123F" 2 "0000123F" 2 "0000132F" 2 "0000132F" 2 "0000133F" 4 "0000133F" 4 "0000133F" 4 "0000133F" 4 "0000206F" 5 "0000206F" 5 "0000206F" 5 "0000206F" 5 "0000206F" 5 "0000207F" 2 "0000207F" 2 "0000208B" 6 "0000208B" 6 "0000208B" 6 "0000208B" 6 "0000208B" 6 "0000208B" 6 "0000209F" 2 "0000209F" 2 "0000219F" 2 "0000219F" 2 "0000222F" 1 "0000306F" 1 "0000308B" 3 "0000308B" 3 "0000308B" 3 "0000310F" 3 "0000310F" 3 "0000310F" 3 "0000322F" 1 "0000323F" 1 "0000332F" 1 "0000333F" 3 "0000333F" 3 "0000333F" 3 "0000406F" 1 "0000407F" 2 "0000407F" 2 "0000408B" 5 "0000408B" 5 "0000408B" 5 "0000408B" 5 "0000408B" 5 "0000420F" 2 "0000420F" 2 "0000522F" 3 "0000522F" 3 "0000522F" 3 "0000523F" 1 "0000532F" 1 "0000607F" 1 "0000622F" 1 "0000623F" 4 "0000623F" 4 "0000623F" 4 "0000623F" 4 "0000706F" 1 "0000708B" 8 "0000708B" 8 "0000708B" 8 "0000708B" 8 "0000708B" 8 "0000708B" 8 "0000708B" 8 "0000708B" 8 "0000723F" 1 "0000733F" 2 "0000733F" 2 "0000803F" 1 "0000806F" 4 "0000806F" 4 "0000806F" 4 "0000806F" 4 "0000807F" 1 "0000820F" 2 "0000820F" 2 "0000823F" 5 "0000823F" 5 "0000823F" 5 "0000823F" 5 "0000823F" 5 "0000832F" 1 end

For some reason, this is not generating unique ids. I thought maybe the issue is it's generating a float variable but even after I try gen long, the problem remains.

Tags: stringvariables

Daniel Schaefer

Join Date: Mar 2020

Posts: 810
#2

30 Oct 2023, 20:26

this code:

Code:

bysort FACTORYId : gen id = _N

will set id to the total number of values (_N) within each group of FACTORYId that are the same. Since the number of FACTORYId's that are the same can be equal across different FACTORYId's, you don't get a unique identifier.

try:

Code:

encode FACTORYId, gen(id)

or even:

Code:

egen id = group(FACTORYId)

Last edited by Daniel Schaefer; 30 Oct 2023, 20:45.
1 like
Comment
Sarita Vaswani

Join Date: Sep 2023

Posts: 14
#3

30 Oct 2023, 22:48

Hi Daniel,

Thank you so much! The second one worked.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 810
#4

30 Oct 2023, 23:09

Hi Sarita,

I bet the first one worked too. The -encode- command keeps the original string as the label for the new numeric value, so you might not be able to see the underlying numeric value with some commands. Look for the nolab option on (e.g.) -tab-. Notice after using -encode- each new observation is blue in the data browser, not red. The blue color signifies that you are seeing a label and not a value. I am entirely sure that, as far as Stata is concerned, the underlying value is a number.

Last edited by Daniel Schaefer; 30 Oct 2023, 23:13.
1 like
Comment
Sarita Vaswani

Join Date: Sep 2023

Posts: 14
#5

06 Nov 2023, 09:57

Hi Daniel,

The first one i.e. "encode" ran into an error - too many variables.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#6

06 Nov 2023, 10:12

Strictly, the limit on encode is the number of distinct values. The number of variables to encode is always 1.
Comment
Bruce McDougall

Join Date: May 2020

Posts: 45
#7

20 Feb 2025, 05:22

Hello!

I just want to add to this conversation that the encode option can cause issues if you have a large enough dataset, due to precision. Encode does not know to use a highly precise format (like double), so I think you can actually mess up the data and give different people the same id. The better way to do it is to use egen:

Code:

egen double id =group(FACTORYId)

This does the same thing but is safe in very large datasets.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#8

20 Feb 2025, 07:14

Bruce McDougall is right to warn that encode has limits, but I don't think they are a matter of precision.

First, encode maps to integers and by default starts at 1 although other choices are possible.

The limit on encode is the number of distinct value labels allowed, which is at present 65536 (2^16). You would have problems fitting integers all the way up to 65536 into a numeric variable unless you were using float, long or double, but it's I imagine rare that people try to do anything else.

Any way, encode just refuses to define more than 65536 labels. So, I am not sure what Bruce is referring to, as he seems to be reporting that encode messes up your data, or produces a messy interpretation of it.

I endorse the advice that egen, group() can be a good way to get numeric identifiers.
Comment
Bruce McDougall

Join Date: May 2020

Posts: 45
#9

20 Feb 2025, 07:53

Hi Nick

Apologies for being a bit vague, partly this is because I can't remember the exact scenario this came up. I may have been using a bysort-type solution or even a solution based on tag. Either way, the important thing is that the problem manifested itself something like the following:

_n id number (string, also happened to be encrypted) id

65535 ASDLKAGLKJA 65535

65536 OITLJDLKLLKS 65536

65537 OITLJDLKLLKS 65536

65538 UTKAJSFKLKL 65536

(The id_number is the string input and the id is the output with the error included.)

Notice that there is a point where the system stops working. This was a little scary as I didn't get an error and thought I had a usable identifier. I didn't realist the bottom row was erroneously being given the same id as the second last. Again, from memory I think I solved it by improving the precision.

So, the lesson for me was to use the egen (group) solution with double (definitely!) specified. I hope this clarifies the "messing up" I was referring to, even if I am not 100% sure how I got there!

Best,
Bruce
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#10

20 Feb 2025, 08:11

If you don't want to think about what data type is correct for this type of exercise, then use the system macro -c(obs_t)-, which tell Stata to pick the optimal data type to hold the value of _n. Said another way, this the smallest type needed to hold the integers from 1 to _N which will always guarantee you will have a unique integer ID using something like -egen, group()-.

For example

Code:

egen `c(obs_t)' id = group(stringid)
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#11

20 Feb 2025, 08:21

#9 It would be good (or at least helpful) to see a reproducible example in which you create a dataset ab initio in which the same numeric identifier is associated with different value labels. We can agree that that should not happen.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#12

20 Feb 2025, 08:46

Originally posted by Nick Cox View Post

#9 It would be good (or at least helpful) to see a reproducible example in which you create a dataset ab initio in which the same numeric identifier is associated with different value labels. We can agree that that should not happen.

An example would be good to see. I doubt that this happened using -encode-, since it would exit with an error complaining of too many values.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#13

20 Feb 2025, 08:50

Indeed. That is my guess too. as in #8.
Comment

Daniel Schaefer

Join Date: Mar 2020
Posts: 810

#14

20 Feb 2025, 09:33

Seems like encode complains about too many values in Stata 18.

Code:

clear
set obs 65536
gen id_string = string(_n)
encode id_string, gen(id)

Code:

. clear

. set obs 65536
Number of observations (_N) was 0, now 65,536.

. gen id_string = string(_n)

. encode id_string, gen(id)

. list if _n >= 65535, clean noobs

    id_str~g      id  
       65535   65535  
       65536   65536  

. 
end of do-file

Code:

clear
set obs 65537
gen id_string = string(_n)
encode id_string, gen(id)
list if _n >= 65535, clean noobs

Code:

. clear

. set obs 65537
Number of observations (_N) was 0, now 65,537.

. gen id_string = string(_n)

. encode id_string, gen(id)
too many values
r(134);

end of do-file

Comment

Bruce McDougall

Join Date: May 2020

Posts: 45
#15

25 Feb 2025, 06:47

Hello! Here is a reproducible example that shows the problem using

egen x=group(y) without specifying the precision:

Code:

clear all set obs 100000000 gen double n=_n tostring n, replace local numbers "1 2 3 4 5 6 7 8 9 0" local letters "a b c d e f g h i j" forvalues i=1/10 { local letter: word `i' of `numbers' local number:word `i' of `letters' replace n=subinstr(n, "`letter'","`number'",.) } egen id=group(n) isid id

Gives the error: variable id does not uniquely identify the observations.

At the bottom of the dataset we on row 99999998 and row 99999999 we have:
row n id
99999998 iiiiiiij 9.88e+07
99999999 iiiiiii 9.88e+07
Comment

_n	id number (string, also happened to be encrypted)	id
65535	ASDLKAGLKJA	65535
65536	OITLJDLKLLKS	65536
65537	OITLJDLKLLKS	65536
65538	UTKAJSFKLKL	65536

Announcement

Generating a unique id from string

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment