Very large Integers of variable length

Adrian Sayers

Join Date: Apr 2014

Posts: 67
#1

Very large Integers of variable length

15 Apr 2024, 12:01

Hi all,

I was wondering if anyone has found a good soloution to working with very large integers as ID variables of variable length.

Background,
Working with a national medical database and they have a key (EPIKEY) which is 19 Numeric characters long.

It is stored in a SQL database as numeric((20,0),null)

This variable should be unique, when you read it directly using odbc load it reads as "double".

Unfortunately, this makes the variable non-unique, i am assuming precision issues.

So i recast this to "BIGINT" and alias it and then it reads as string.

Code:

odbc load, exec("SELECT * , CAST(EPIKEY AS BIGINT) AS EPIKEY_STR FROM .....

This seems like an ugly and inefficient workaround.

20bytes is quite an expensive storage.

An equally ugly workaround would be

Code:

gen long epikey_a = real(substr(epikey,1,9)) gen long epikey_b = real(substr(epikey,10,9)) gen byte epikey_c = real(substr(epikey,19,1))

But this 10 bytes would be a lot more efficent, but who wants to work with 3 variables when you can have 1.

However, splitting the keys like this unfortunately doesn't work as the ID's are not always the same length.
and if the split occurs on "001" this is converted to "1" when stored as a long.

Stata have this helpful comment on there help file

If you are storing identification numbers, the rounding could matter. If the identification numbers are integers and take 9 digits or less, store them as
longs; otherwise, store them as doubles. doubles have 16 digits of accuracy.

But now I think we need some help on holding more than 16 digits of accuracy.

It seems like i am stuck with a string for the moment.

bw
Adrian
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#2

15 Apr 2024, 12:26

I have found myself in this position from time to time. If the count of distinct values of the identifiers is small enough to fit into a long, I just create a new identifier:

Code:

egen long new_id = group(original_id)

Actually, I do this by creating a file that crosswalks the original_id with the new_id. I keep that safe in case it is needed. Then I merge the new_id into the working data set(s) and eliminate the original_id variable.

If the number of distinct values is so large that even 16 digits accuracy cannot work, then I think you are stuck with the workarounds you are currently doing.

While I'm inclined to agree that some kind of big integer storage type would probably be useful, I have been impressed at the lengths that some institutions will go to to mask their identifier variable. I have seen several data sets from hospitals where the pseudoidentifier is a string of 64 characters (hexadecimal digits) and even seen one that used a string of 128 characters. Clearly the number of distinct patients cannot remotely require that many bytes. But it also suggests that no big integer storage type is going to be adequate for this task as there will always be system administrators out there who will break it in the name of confidentiality.
1 like
Comment
Adrian Sayers

Join Date: Apr 2014

Posts: 67
#3

15 Apr 2024, 12:31

Thanks Clyde,

Sometimes it makes me wonder what they are trying to do with these super long ID.

I had thought i was missing something, but perhaps sadly i am not.
A
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3407
#4

15 Apr 2024, 15:12

Long I'd numbers can come about if they are generated in steps: first digit encodes the continent, second and third the country, fourth and fifth the site, sixth and seventh the village, eighth, nineth and tenth the household, eleventh and twelveth the individual. Now you have an id that won't fit in a float. This is of course the very opposite of annonymising the data.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment

Announcement

Very large Integers of variable length

Comment

Comment

Comment