How to convert a long string variable into a numeric variable?

Radi Damianov

Join Date: Feb 2024

Posts: 5
#1

How to convert a long string variable into a numeric variable?

06 Feb 2024, 07:58

HI there! I am currently trying to convert an ID string variable into a numeric variable, as I will need the numeric variable at a later stage for creating a panel ID and making the dataset panel. My issue is that the ID numbers are quite long, thus the values in the string are quite long and when I try and use the destring command, it creates a numerical variable however the values are all in scientific notation. Thus I am not sure how I can proceed, all I want is to basically keep the same long ID value for each observation, but have it in numerical form instead of string. I tried the encode command as well however I have a really big dataset and thus STATA runs out of values to give this encoded variable so that didn't work. Does anyone have any ideas for me? Much appreciated!
Tags: None
Ken Chui

Join Date: Aug 2014

Posts: 1057
#2

06 Feb 2024, 09:01

Based on my experience, I would strongly recommend you keep them as string, especially if they are long chains of numbers. You can manage data and merge data using string variables.

To convert your other numerical one into string, use tostring:

Code:

clear input id 1 2 3 4 5 end tostring id, gen(id_string) browse
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35448
#3

06 Feb 2024, 11:08

Quotations here are slightly edited.

I am currently trying to convert an ID string variable into a numeric variable, as I will need the numeric variable at a later stage for creating a panel ID and making the dataset panel. My issue is that the ID numbers are quite long, thus the values in the string are quite long and when I try and use the destring command, it creates a numerical variable however the values are all in scientific notation.

This just means that you should assign a different display format using format, say

Code:

format newid %16.0f

I tried the encode command as well however I have a really big dataset and thus Stata runs out of values to give this encoded variable so that didn't work. Does anyone have any ideas for me?

If you have potentially too many value labels, you're unlikely to need or use them ever, and if you're puzzled by particular observations you can always look at your string identifier to see what it is.

Code:

egen long gid = group(strid)

maps distinct strings to values 1 up and should serve most of the purposes you mention well enough
1 like
Comment
Radi Damianov

Join Date: Feb 2024

Posts: 5
#4

06 Feb 2024, 12:00

Thanks for the help, I also checked and saw that my ID variable has 18 characters, while I read somewhere that the automatic STATA maximum characters for a variable is 16, is there a code that I can run to change it to 18 characters? I think this is why the encode variable isn't working correctly because it is assigning a value based on only 16 of the 18 characters in each person's ID number...
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#5

06 Feb 2024, 12:23

I think you are confusing a few things. The maximum number of characters in a string variable in Stata is not 16, it is 2,000,000,000. (!) What I think you mean to say is that the largest numerical variable storage type, the double, can maintain a precision of 16 decimal digits. If your ID variables are 18 characters, then -destring-ing to numeric will inevitably lose that level of precision, which for an ID variable, makes the variable completely useless.

As for -encode-, the limit on the number of distinct values it can handle (which is not the same as the length of the strings being encoded) is 65,536. You have already tried -encode- and learned that your ID variable takes on too many values for it. Truncating your ID variable, even if it could be done with no loss of information, to 16 characters will not resolve the problem. The problem is that you have more than 65,536 different IDs, so just forget about -encode- for this.

Finally you are left with the possibility of using the -egen, group()- function, recommended in #3. This function does not have the same limit as -encode-, it can accommodate a much larger number of distinct ID values. The number of distinct ID values it can handle is limited by the precision of the data storage type in which it places its results. If you have fewer than 1,000,000,000 distinct ID values, then you should be able to use -egen long newid = group(ID)- to create a numeric variable that starts at 1 and counts up to the number of distinct ID values, in 1-1 correspondence with the ID values. (And you should save a data set that contains that correspondence so that you can always refer back to the original data if needed.) If you have more than 1,000,000,000 distinct ID values, then you need to use -egen double newid = group(ID)-, since a double can accommodate an even larger number of distinct values. And as was also pointed out in #3, don't worry about them appearing in scientific notation: -format newid %16.0f- will take care of that.

Finally, ask yourself why you want to create a numeric ID variable instead of an 18 character string. The main advantage is that you can save some space in both RAM and on disk, and thereby also speed-up your execution. If you have a very large number of observations, this will be appreciable. But in small or moderate size data sets the improvement will not really be perceptible. The other reason for creating a numeric ID is if you need to use it as the panel variable for -xt- commands, since -xtset- will not accept a string variable. If neither of these conditions applies to you, you might be best off following the advice in #2 to just stay with the strings.
1 like
Comment
Radi Damianov

Join Date: Feb 2024

Posts: 5
#6

08 Feb 2024, 12:32

Thanks for all the help. I have one last question Clyde, when I run the egen command and browse the variable, it doesn't go 1, 2, 3, 4 etc. rather it takes irregular values like 2, 5, 9, 16 etc., should I be concerned that something did not work correctly or does it not matter if the values don't increase in increments of
Comment
Radi Damianov

Join Date: Feb 2024

Posts: 5
#7

08 Feb 2024, 12:33

*Increments of 1, apologies it got cut off for some reason!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35448
#8

08 Feb 2024, 18:24

I suspect that you just need to sort on the new variable and otherwise check that the possible values integers 1 to the number of distinct values are all present.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#9

08 Feb 2024, 18:25

when I run the egen command and browse the variable, it doesn't go 1, 2, 3, 4 etc. rather it takes irregular values like 2, 5, 9, 16 etc.

If that's really what happend, then, yes, something went wrong.

But I'm going to be that it didn't really happen. It just looks that way. When -egen- creates the numeric codes, it assigns the numbers 1, 2, 3, 4... to the values of the string variable in alphabetic order. (Well, actually in ASCII or UNICODE sort order.) But, it leaves the data in whatever sort order they were originally in. So if your data were not in alphabetic order to start with, when you run -egen, group()- and then look at the data, you will see the data in their original order and the new numeric codes will not be in numeric order. In fact, depending on how your data were originally ordered, the string corresponding to 1 could be anywhere, even way down at the end of the data set!

The way to check what's going on is to -summarize- the new numeric variable that -egen, group()- created. You should see that the minimum value is 1 and the maximum value is the number of distinct values of your original string variable.

Added: Crossed with #8.
Comment
Radi Damianov

Join Date: Feb 2024

Posts: 5
#10

09 Feb 2024, 12:15

Thank you very much for the help and explanations, much appreciated!
Comment

Announcement

How to convert a long string variable into a numeric variable?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment