Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • converting unique string variables from two datasets that are characters and numbers to numeric variable

    Hello,

    I currently have two datasets with a variable that is called scrambled_id. Scramble_id represents a unique campus. However, both data sets that use scramble_id have a different total number of observations so it is impossible to encode from string (values) to numeric. Scramble_id is a string variable with observations similar to this set up: 132$WWW$P. This type of setup is similar for almost all the observations as they are scrambled to mask identity. I am trying to create a new variable that is converted from scramble_id (which is string) to a numeric variable that would be the same for both datasets so that I could merge both datasets by this variable later on. Any assistance would be appreciated. An example of the variable and observations is below.


    scrambled_id
    132$WWW$P
    136P74WEW
    137353W$W
    134P2WW22
    131470P33
    137PWWP15
    138264$W7
    1360W0$$W
    13757RPRP
    134045P10
    137P65$W3
    132134$L7
    132375P01
    13450WP$3
    135410R06
    1395W2P$7
    136150PR3
    13935$$35
    130P65$E4

  • #2
    If the unique identifier is the same in both datasets, you should be able to use it in a merge without converting it to numeric format, so I am not seeing what the problem you're trying to solve is.
    Last edited by Ali Atia; 18 Jan 2022, 17:41.

    Comment


    • #3
      Sorry if I was not clear. The unique identifiers are the same in both sets as strings. However when I encode to numeric from the string the values do not come out the same for both data sets.

      For example:

      132$WWW$P may come out to be 1 (after being sorted) in dataset one when encoded, but because it may not be the first observation in the second dataset, when encoded it could be 2 or any other number if that makes sense. I am trying to figure out a way to encode the strings, so that regardless of their actual placement after sorting, 132$WWW$P encodes to 1 in the first dataset and in the second dataset 132$WWW$P encodes to 1 although it is not the first observation. I hope I am making sense.

      I cannot merge now because when I encode the string variable in both datasets (unique_id), those observations are converting to different values because of the order they are in after being sorted.

      Comment


      • #4
        First, you can merge by string variables. There is no need to make them numeric.

        Second, if you so wish to turn them into numbers, append the data sets that contains only the id variable (aka if you have 100 and 200 IDs in each file, make a file that has 300 IDs) so that all the possible IDs are in the same variable. Then:

        Code:
        egen unique = tag(id)
        keep if unique == 1
        encode id, gen(num_id)
        label values num_id
        And then you'll have a "crosswalk" file containing the string and the numerical versions. You can then merge them back to each of their file. But as I said, this is completely redundant for what you're trying to do.
        Last edited by Ken Chui; 18 Jan 2022, 21:47.

        Comment

        Working...
        X