Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing Non-Numeric Characters from Strings

    I am working with a dataset that contains addresses in Armenian. The house number field is usually numeric, but can also have letters / words before or after the number (the equivalent of 14A or Lower 16). When the data goes into Stata, the Armenian characters become symbols. I am trying to figure out a way to extract just the numbers from the string because I need to search another larger dataset for the nearest "whole number" address. The characters appear in different parts of the field and the numbers are different lengths, so a standard substring won't work. I know there is a way to do this. Any suggestions most welcome.

    Here is an example of what the data looks like:

    var1
    14 ³
    15·
    ¹6
    16 ѳñ³í

  • #2
    Here's an approach that uses regular expressions

    Code:
    clear
    input str10 s
    "14 ³"
    "15·"
    "¹6"
    "16 ѳñ³í"
    end
    gen n = real(regexs(1)) if regexm(s,"([0-9]+)")
    list

    Comment


    • #3
      If you are absolutely sure that all of these non-numeric characters are uninformative (at least for your purposes), and they run over a very large range of symbols, then this might be one of those rare situations where -destring- with the -force- option makes sense.

      Code:
      destring var1, gen(numeric_var1) force
      But before doing that I would be very, very cautious that you will not be discarding real information. Are you sure that these extraneous characters never occur within the number? The above code will fail if they do. Also I would run -charlist- (available from SSC) followed by -return list- to be sure the non-numeric characters are only what you think they are.

      Another approach, again if you're sure that the nugget of information you need is a sequence of digits, possibly surrounded by exclusively non-digits, would be

      Code:
      gen numeric_var1 = regexs(2) if regexm(var1, "^([^0-9]*)([0-9]+)([^0-9]*)$")

      Comment


      • #4
        Robert is much more knowledgeable about regular expressions than I am. Let me just point out a difference between his solution and mine, and Kristen pick the one that's suitable for her needs.

        If a value of var1 is, for example "#1X2", Robert's code will match that and give 1 as the value of n. My code will read this as a non-match and will give missing as the result.

        Comment


        • #5
          Clyde, I don't think that destring works like that, you need to explicitly exclude unwanted characters using the ignore() option.

          Your regex pattern is more conservative than the one I used which only matches the first series of digits. Probably a wiser choice. I could also recommend moss (from SSC) to match multiple occurrences of digits using

          Code:
          moss s, match("([0-9]+)") regex

          Comment


          • #6
            Robert, You are right. The -force- option will turn any string that isn't a certifiable number to missing. (I almost never use the -force- option on -destring-, so I forgot how it works.)

            Kristin, sorry--disregard that -destring- code: go with one of the regular expression commands that Robert or I gave you.

            Comment


            • #7
              See if this program by Michael Blasnik does what you want:

              http://www.stata.com/statalist/archi.../msg00353.html

              For convenience, here it is:

              Code:
              program define extrnum
              version 7
              syntax varlist(max=1) , gen(str)
              local maxlen: type `varlist'
              local maxlen=substr("`maxlen'",4,.)
              tempvar work
              qui gen str1 `work'=""
              forvalues i=1/`maxlen' {
               qui replace `work'=`work'+substr(`varlist',`i',1) if real(substr(`varlist',`i',1))<.
              }
              gen `gen'=real(`work')
              end
              Code:
              . list
              
                   +----------+
                   |     var1 |
                   |----------|
                1. |     var1 |
                2. |     14 ³ |
                3. |      15· |
                4. |       ¹6 |
                5. | 16 ѳñ³í |
                   +----------+
              
              . extrnum var1, gen(nvar1)
              
              . list
              
                   +------------------+
                   |     var1   nvar1 |
                   |------------------|
                1. |     var1       1 |
                2. |     14 ³      14 |
                3. |      15·      15 |
                4. |       ¹6       6 |
                5. | 16 ѳñ³í      16 |
                   +------------------+
              -------------------------------------------
              Richard Williams, Notre Dame Dept of Sociology
              StataNow Version: 18.5 MP (2 processor)

              EMAIL: [email protected]
              WWW: https://www3.nd.edu/~rwilliam

              Comment


              • #8
                Also, if you have egenmore installed (and if so this is probably easier):

                Code:
                . egen nvar2 = sieve(var1), keep(numeric)
                
                . list
                
                     +--------------------------+
                     |     var1   nvar1   nvar2 |
                     |--------------------------|
                  1. |     var1       1       1 |
                  2. |     14 ³      14      14 |
                  3. |      15·      15      15 |
                  4. |       ¹6       6       6 |
                  5. | 16 ѳñ³í      16      16 |
                     +--------------------------+
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                StataNow Version: 18.5 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  Clyde, can you cite any Stata documentation that describes in reasonable detail regular expressions acceptable to Stata? Both help regexm and [D] only say "based on Henry Spencer's NFA algorithm" and "nearly identical to the POSIX.2 standard", and search regexm yields a misleading description from 2005 that is apparently badly outdated. In a few earlier postings I ranted mildly against the inadequate documentation and lack of support for sophisticated regular expressions based on what little documentation I could find with help and search. Your example has shown that the "lack of support" part of the rants was incorrect; perhaps the "documentation" part was also incorrect.

                  Thanks for any advice, and thanks especially for the example you posted above.

                  Comment


                  • #10
                    Thanks everyone. The "real regexs" command worked well so far, but I need to loop this over 821000 variables, so the others may come in handy for any special cases that arise.

                    Comment


                    • #11
                      Originally posted by Kristen Himelein View Post
                      ... I need to loop this over 821000 variables, ....
                      Probably "values". Stata has a limit of 32,767 variables. Sergiy

                      Comment


                      • #12
                        @William. No, actually I don't know of any place with a good summary of regular expressions acceptable to Stata. I don't use regular expressions very often. When I need to refresh my memory I just Google regular expressions POSIX and explore the hits that I get that way. I agree that the Stata documentation should be upgraded to include this material.

                        Comment

                        Working...
                        X