Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Transforming String to Numeric Variables (Encode - missing zeros)

    If you check the range of the original string variable it starts at zero but if we use encode to transform the string into a numerical value, it yields the data starts at 1. The code used is encode Variable, gen (New Variable).
    We tried to use real() instead but it drops the non-numeric values which we need to keep in for the analysis to be correct.

    How can we transform the string into numerical values without the zeros being non-recognised?

  • #2
    Leading zeros should be copied to the value labels.


    Code:
    
    clear 
    input str3 incoming 
    "001"
    "042"
    "666"
    end 
    
    encode incoming, gen(wanted)
    
    list 
    
    list, nolabel
    Code:
     
    
    . list 
    
         +-------------------+
         | incoming   wanted |
         |-------------------|
      1. |      001      001 |
      2. |      042      042 |
      3. |      666      666 |
         +-------------------+
    
    . 
    . list, nolabel 
    
         +-------------------+
         | incoming   wanted |
         |-------------------|
      1. |      001        1 |
      2. |      042        2 |
      3. |      666        3 |
         +-------------------+
    
    .

    Comment


    • #3
      Originally posted by Sophie Baettig View Post
      We tried to use real() instead but it drops the non-numeric values which we need to keep in for the analysis to be correct.

      How can we transform the string into numerical values without the zeros being non-recognised?
      From your description it sounds like you are trying to preserve the information of some kind of identifier, like a patient identifier that might be randomly assigned. Many analyses require these to be numeric, but the actual value they take is irrelevant as long as there is a 1:1 mapping with the true, original value, so that the structure of the data is maintained. Value labels are also not needed but may ease your mind when browsing or listing data.

      Encode will assign integers, starting from 1 to each unique value in its sorted order, and also take on the original values as its value labels. Nick demonstrates this succinctly. The real() function could be problematic because if there are nonnumeric characrers, the function returns a missing value which is not useful.

      Comment


      • #4
        If the situation is as Leonardo expects, real() could be dangerous. Worse than useless missing values are duplicate valid values for distinct input values:

        Code:
        clear
        input str3 id
        "1"
        "01"
        "001"
        end
        generate id_real = real(id)
        list
        The example above yields

        Code:
        . list
        
             +---------------+
             |  id   id_real |
             |---------------|
          1. |   1         1 |
          2. |  01         1 |
          3. | 001         1 |
             +---------------+

        Comment


        • #5
          Originally posted by daniel klein View Post
          If the situation is as Leonardo expects, real() could be dangerous.
          [/code]
          This is an excellent emphasis as I wasn't considering this particular example. I was perhaps too used to seeing fixed-format identifiers that may have some letters prefixed to a fixed-width numeric identifier, such as "STUDYID-SUBJECTID".

          Comment

          Working...
          X