converting unique string variables from two datasets that are characters and numbers to numeric variable

John Williams III

Join Date: Jan 2022

Posts: 2
#1

converting unique string variables from two datasets that are characters and numbers to numeric variable

18 Jan 2022, 17:14

Hello,

I currently have two datasets with a variable that is called scrambled_id. Scramble_id represents a unique campus. However, both data sets that use scramble_id have a different total number of observations so it is impossible to encode from string (values) to numeric. Scramble_id is a string variable with observations similar to this set up: 132$WWW$P. This type of setup is similar for almost all the observations as they are scrambled to mask identity. I am trying to create a new variable that is converted from scramble_id (which is string) to a numeric variable that would be the same for both datasets so that I could merge both datasets by this variable later on. Any assistance would be appreciated. An example of the variable and observations is below.

scrambled_id
132$WWW$P
136P74WEW
137353W$W
134P2WW22
131470P33
137PWWP15
138264$W7
1360W0$$W
13757RPRP
134045P10
137P65$W3
132134$L7
132375P01
13450WP$3
135410R06
1395W2P$7
136150PR3
13935$$35
130P65$E4
Tags: None
Ali Atia

Join Date: May 2020

Posts: 737
#2

18 Jan 2022, 17:36

If the unique identifier is the same in both datasets, you should be able to use it in a merge without converting it to numeric format, so I am not seeing what the problem you're trying to solve is.

Last edited by Ali Atia; 18 Jan 2022, 17:41.
3 likes
Comment
John Williams III

Join Date: Jan 2022

Posts: 2
#3

18 Jan 2022, 21:33

Sorry if I was not clear. The unique identifiers are the same in both sets as strings. However when I encode to numeric from the string the values do not come out the same for both data sets.

For example:

132$WWW$P may come out to be 1 (after being sorted) in dataset one when encoded, but because it may not be the first observation in the second dataset, when encoded it could be 2 or any other number if that makes sense. I am trying to figure out a way to encode the strings, so that regardless of their actual placement after sorting, 132$WWW$P encodes to 1 in the first dataset and in the second dataset 132$WWW$P encodes to 1 although it is not the first observation. I hope I am making sense.

I cannot merge now because when I encode the string variable in both datasets (unique_id), those observations are converting to different values because of the order they are in after being sorted.
Comment
Ken Chui

Join Date: Aug 2014

Posts: 1054
#4

18 Jan 2022, 21:44

First, you can merge by string variables. There is no need to make them numeric.

Second, if you so wish to turn them into numbers, append the data sets that contains only the id variable (aka if you have 100 and 200 IDs in each file, make a file that has 300 IDs) so that all the possible IDs are in the same variable. Then:

Code:

egen unique = tag(id) keep if unique == 1 encode id, gen(num_id) label values num_id

And then you'll have a "crosswalk" file containing the string and the numerical versions. You can then merge them back to each of their file. But as I said, this is completely redundant for what you're trying to do.

Last edited by Ken Chui; 18 Jan 2022, 21:47.
1 like
Comment

Announcement

converting unique string variables from two datasets that are characters and numbers to numeric variable

Comment

Comment

Comment