Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to destring variable when it contains non-numeric characteristics

    Dear Stata Specialist,

    I have a question about destring some non-numeric variables. I know in generate I can do destring variable, replace.

    But if the variable contains non-numeric characteristics, for example, it is coded as AA1101, AA1102, AA1103, etc. What should I do? I knew encode will work, I think after encode and generate a new variable it will has the value from AA1101, AA1102, AA1103 change to 1, 2, 3... etc. But my problem is for this variable it contains more than 500K observations, it doesn't allow me use encode as it says too many observations..

    Any suggestions??

    Thanks
    Yuchen

  • #2
    What you want to do is -egen long wanted = group(the_string_variable)-.

    This will create a numeric variable that counts up from 1 and is in 1:1 correspondence with distinct values of the string variable. It will not, however, apply a label to the values, so these variables will be displayed in listings and outputs as 1, 2, 3, etc. You will not see AA1101, etc. in the new variable. But there will be a 1:1 correspondence between those original values and the new variable's 1, 2, 3, etc.

    Please pay attention to Stata terminology. -destring- is something completely different, and is not applicable to this kind of variable. It is important to understand the difference between -destring- and -encode- (or -egen, group()-).

    -destring- is used when you have a string variable whose content is entirely numeric, its values look like "1", "756.2", "3.1", etc. and you want to change that to a numeric variable whose values are 1, 756.2, 3.1, etc.

    -encode- is used to create a variable that counts up from 1 in 1:1 correspondence with the distinct values of the string variable. It is primarily used to provide a numeric scheme for representing discrete variables like religion, nationality, income categories (but not income itself), etc. Such numeric schemes are needed to use those variables in regressions or other analyses. -encode- is limited to variable with at most 65,536 distinct values. When a variable exceeds that limit, -egen, group()- can be used instead, but the results are not labeled.

    If you mistakenly attempt to -destring- a variable that should be -encode-d, you will, as you have found, get an error message about non-numeric characters. Unfortunately, if you apply -encode- to a variable that should be -destring-ed, Stata will create a variable that is truly dangerous: in displays and listings it looks like a proper numeric variable with the values that were in the original, but it isn't that. It's a 1, 2, 3,... counter that has been disguised by value labels. If you do computations with it you will get garbage.
    Last edited by Clyde Schechter; 24 Jun 2022, 09:10.

    Comment


    • #3
      Here is one possibility:

      Code:
      * here is a test case
      clear
      input str60 strnum
      "AA1101"
      "AA1102"
      "AA1103"
      end
      
      * extract the number from the string
      gen numonlystr = regexs(0) if regexm(strnum, "[0-9]+")
      * convert the extracted string number to a real number
      gen numericonly = real(numonlystr)
      This will give you a numeric representation of the actual number in the string - it will not assign 1, 2, 3 and so on the way Clyde's solution will. Is this the kind of behavior you are looking for?
      Last edited by Daniel Schaefer; 24 Jun 2022, 09:12.

      Comment


      • #4
        If you mistakenly attempt to -destring- a variable that should be -encode-d, you will, as you have found, get an error message about non-numeric characters. Unfortunately, if you apply -encode- to a variable that should be -destring-ed, Stata will create a variable that is truly dangerous: in displays and listings it looks like a proper numeric variable with the values that were in the original, but it isn't that. It's a 1, 2, 3,... counter that has been disguised by value labels. If you do computations with it you will get garbage.
        A few years ago I did a simple OLS regression in Stata and Mplus, only to get two completely different results. It took me hours to discover I had confused -encode- and -destring-. Had I not preformed the same procedure on both platforms I may have never realized my mistake. Truly dangerous indeed...

        Comment


        • #5
          Code:
          egen long wanted = group(the_string_variable), label
          is allowed too. It's essentially equivalent to encode's default.

          Comment


          • #6
            A question that has not been asked: in all your examples, the first two characters are always non-numeric and the remaining characters are always numeric. Is that always the case?

            If, for all your data, there is a sequence of digits in the third through final characters of the string that you want to place in a numeric variable, then
            Code:
            generate wanted = real(substr(the_string_variable,3,.))
            will convert the numeric characters to a numeric value. If the number of digits to be converted is greater than 7, then you will need to use one of the following:
            Code:
            generate long wanted = real(substr(the_string_variable,3,.))
            generate double wanted = real(substr(the_string_variable,3,.))
            For example,
            Code:
            clear
            input str60 the_string_variable
            "AA1101"
            "AA1102"
            "AA1103"
            "AA7654321"
            "AA87654321"
            "AA987654321"
            end
            generate wantedF = real(substr(the_string_variable,3,.))
            generate long wantedL = real(substr(the_string_variable,3,.))
            generate double wantedD = real(substr(the_string_variable,3,.))
            format %12.0f wanted*
            describe
            list, clean noobs
            Code:
            . describe
            
            Contains data
             Observations:             6                  
                Variables:             4                  
            ------------------------------------------------------------------------------------------------
            Variable      Storage   Display    Value
                name         type    format    label      Variable label
            ------------------------------------------------------------------------------------------------
            the_string_va~e str60   %60s                  
            wantedF         float   %12.0f                
            wantedL         long    %12.0f                
            wantedD         double  %12.0f                
            ------------------------------------------------------------------------------------------------
            Sorted by: 
                 Note: Dataset has changed since last saved.
            
            . list, clean noobs
            
                the_strin~e     wantedF     wantedL     wantedD  
                     AA1101        1101        1101        1101  
                     AA1102        1102        1102        1102  
                     AA1103        1103        1103        1103  
                  AA7654321     7654321     7654321     7654321  
                 AA87654321    87654320    87654321    87654321  
                AA987654321   987654336   987654321   987654321

            Comment


            • #7
              Re #5: -egen, group() label- will not work for this problem, because -encode- failed due to exceeding the limit on the number of value labels, and -egen, group() label- is subject to the same limitation.

              Comment


              • #8
                Originally posted by William Lisowski View Post
                A question that has not been asked: in all your examples, the first two characters are always non-numeric and the remaining characters are always numeric. Is that always the case?

                If, for all your data, there is a sequence of digits in the third through final characters of the string that you want to place in a numeric variable, then
                Code:
                generate wanted = real(substr(the_string_variable,3,.))
                will convert the numeric characters to a numeric value. If the number of digits to be converted is greater than 7, then you will need to use one of the following:
                Code:
                generate long wanted = real(substr(the_string_variable,3,.))
                generate double wanted = real(substr(the_string_variable,3,.))
                For example,
                Code:
                clear
                input str60 the_string_variable
                "AA1101"
                "AA1102"
                "AA1103"
                "AA7654321"
                "AA87654321"
                "AA987654321"
                end
                generate wantedF = real(substr(the_string_variable,3,.))
                generate long wantedL = real(substr(the_string_variable,3,.))
                generate double wantedD = real(substr(the_string_variable,3,.))
                format %12.0f wanted*
                describe
                list, clean noobs
                Code:
                . describe
                
                Contains data
                Observations: 6
                Variables: 4
                ------------------------------------------------------------------------------------------------
                Variable Storage Display Value
                name type format label Variable label
                ------------------------------------------------------------------------------------------------
                the_string_va~e str60 %60s
                wantedF float %12.0f
                wantedL long %12.0f
                wantedD double %12.0f
                ------------------------------------------------------------------------------------------------
                Sorted by:
                Note: Dataset has changed since last saved.
                
                . list, clean noobs
                
                the_strin~e wantedF wantedL wantedD
                AA1101 1101 1101 1101
                AA1102 1102 1102 1102
                AA1103 1103 1103 1103
                AA7654321 7654321 7654321 7654321
                AA87654321 87654320 87654321 87654321
                AA987654321 987654336 987654321 987654321
                Hi William,

                Thank you so much. This is exactly what I want. I haven't tried it on my data yet (I am currently using the data at the research center at my university and I can only access the data there due to the data confidentiality issue, and they don't allow us access the internet there). I will test it next Monday then I will let you know. But so far it looks like this is exactly what I want, I was hoping to generate the new variable to get rid of all the non-numerical characteristics and keep all the numerical characteristics.

                Thanks again for everyone's help. You guys are fantastic.

                All the best
                Yuchen

                Comment


                • #9
                  Re #6: I think it is worth repeating William Lisowski’s point very explicitly for any future readers: the floating point type does not represent all of the digits with perfect fidelity when the number of digits is greater than 7. Notice that wantedF is coerced to a floating point number behind the scenes - even though you might think it is an integer. Rounding errors are introduced at the end of numbers with more than 7 digits. Notice that if you need perfect fidelity, then this is a silent error and may not be obvious or easy to detect.

                  Keeping in mind that these are all signed types, a 16 bit integer can only safely represent 4 digits, a 32 bit float can safely represent 7 digits, a 32 bit long integer can safely represent 9 digits, and a 64 bit double can safely represent 15 digits.

                  Code:
                  clear
                  input str60 the_string_variable
                  "AA1101"
                  "AA1102"
                  "AA1103"
                  "AA6543210"
                  "AA76543210"
                  "AA876543210"
                  "AA9876543210"
                  end
                  generate wantedF = real(substr(the_string_variable,3,.))
                  generate long wantedL = real(substr(the_string_variable,3,.))
                  generate double wantedD = real(substr(the_string_variable,3,.))
                  format %12.0f wanted*
                  describe
                  list, clean noobs
                  Code:
                  . describe
                  
                  Contains data
                   Observations:             7                  
                      Variables:             4                  
                  --------------------------------------------------------------------------------
                  Variable      Storage   Display    Value
                      name         type    format    label      Variable label
                  --------------------------------------------------------------------------------
                  the_string_va~e str60   %60s                  
                  wantedF         float   %12.0f                
                  wantedL         long    %12.0f                
                  wantedD         double  %12.0f                
                  --------------------------------------------------------------------------------
                  Sorted by:
                       Note: Dataset has changed since last saved.
                  
                  . list, clean noobs
                  
                      the_string~e      wantedF     wantedL      wantedD  
                            AA1101         1101        1101         1101  
                            AA1102         1102        1102         1102  
                            AA1103         1103        1103         1103  
                         AA6543210      6543210     6543210      6543210  
                        AA76543210     76543208    76543210     76543210  
                       AA876543210    876543232   876543210    876543210  
                      AA9876543210   9876543488           .   9876543210  
                  
                  .
                  end of do-file
                  
                  .

                  Comment


                  • #10
                    Daniel Schaefer nicely makes in post #9 the points about precision that I left implicit in my post #4, where the explicit point I emphasized was that we could take advantage of structure in the data to simplify the conversion of strings to numeric values, bypassing the value labelling issues.

                    To complement post #9, I'll post here a reference table on precision I have posted before for similar discussions.

                    These are the limits on storage of decimal integers with full accuracy in the various numeric storage types. The fixed-point variables lose the 27 largest positive values to missing value codes; the similar loss for floating point variables occurs only for the largest exponent, so it doesn't affect the much smaller integer values.

                    byte - 7 bits -127 100
                    int - 15 bits -32,767 32,740
                    long - 31 bits -2,147,483,647 2,147,483,620
                    float - 24 bits -16,777,216 16,777,216
                    double - 53 bits -9,007,199,254,740,992 9,007,199,254,740,992

                    Comment

                    Working...
                    X