Removing Non-Numeric Characters from Strings

Kristen Himelein

Join Date: Apr 2014

Posts: 10
#1

Removing Non-Numeric Characters from Strings

05 Mar 2015, 10:29

I am working with a dataset that contains addresses in Armenian. The house number field is usually numeric, but can also have letters / words before or after the number (the equivalent of 14A or Lower 16). When the data goes into Stata, the Armenian characters become symbols. I am trying to figure out a way to extract just the numbers from the string because I need to search another larger dataset for the nearest "whole number" address. The characters appear in different parts of the field and the numbers are different lengths, so a standard substring won't work. I know there is a way to do this. Any suggestions most welcome.

Here is an example of what the data looks like:

var1
14 ³
15·
¹6
16 Ñ³ñ³í
Tags: None
Robert Picard

Join Date: Mar 2014

Posts: 1536
#2

05 Mar 2015, 10:48

Here's an approach that uses regular expressions

Code:

clear input str10 s "14 ³" "15·" "¹6" "16 Ñ³ñ³í" end gen n = real(regexs(1)) if regexm(s,"([0-9]+)") list
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#3

05 Mar 2015, 10:52

If you are absolutely sure that all of these non-numeric characters are uninformative (at least for your purposes), and they run over a very large range of symbols, then this might be one of those rare situations where -destring- with the -force- option makes sense.

Code:

destring var1, gen(numeric_var1) force

But before doing that I would be very, very cautious that you will not be discarding real information. Are you sure that these extraneous characters never occur within the number? The above code will fail if they do. Also I would run -charlist- (available from SSC) followed by -return list- to be sure the non-numeric characters are only what you think they are.

Another approach, again if you're sure that the nugget of information you need is a sequence of digits, possibly surrounded by exclusively non-digits, would be

Code:

gen numeric_var1 = regexs(2) if regexm(var1, "^([^0-9]*)([0-9]+)([^0-9]*)$")
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#4

05 Mar 2015, 10:56

Robert is much more knowledgeable about regular expressions than I am. Let me just point out a difference between his solution and mine, and Kristen pick the one that's suitable for her needs.

If a value of var1 is, for example "#1X2", Robert's code will match that and give 1 as the value of n. My code will read this as a non-match and will give missing as the result.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#5

05 Mar 2015, 11:12

Clyde, I don't think that destring works like that, you need to explicitly exclude unwanted characters using the ignore() option.

Your regex pattern is more conservative than the one I used which only matches the first series of digits. Probably a wiser choice. I could also recommend moss (from SSC) to match multiple occurrences of digits using

Code:

moss s, match("([0-9]+)") regex
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#6

05 Mar 2015, 11:22

Robert, You are right. The -force- option will turn any string that isn't a certifiable number to missing. (I almost never use the -force- option on -destring-, so I forgot how it works.)

Kristin, sorry--disregard that -destring- code: go with one of the regular expression commands that Robert or I gave you.
Comment

Richard Williams

Join Date: Apr 2014
Posts: 4945

05 Mar 2015, 11:26

See if this program by Michael Blasnik does what you want:

http://www.stata.com/statalist/archi.../msg00353.html

For convenience, here it is:

Code:

program define extrnum
version 7
syntax varlist(max=1) , gen(str)
local maxlen: type `varlist'
local maxlen=substr("`maxlen'",4,.)
tempvar work
qui gen str1 `work'=""
forvalues i=1/`maxlen' {
 qui replace `work'=`work'+substr(`varlist',`i',1) if real(substr(`varlist',`i',1))<.
}
gen `gen'=real(`work')
end

Code:

. list

     +----------+
     |     var1 |
     |----------|
  1. |     var1 |
  2. |     14 ³ |
  3. |      15· |
  4. |       ¹6 |
  5. | 16 Ñ³ñ³í |
     +----------+

. extrnum var1, gen(nvar1)

. list

     +------------------+
     |     var1   nvar1 |
     |------------------|
  1. |     var1       1 |
  2. |     14 ³      14 |
  3. |      15·      15 |
  4. |       ¹6       6 |
  5. | 16 Ñ³ñ³í      16 |
     +------------------+

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Comment

Richard Williams

Join Date: Apr 2014
Posts: 4945

05 Mar 2015, 11:30

Also, if you have egenmore installed (and if so this is probably easier):

Code:

. egen nvar2 = sieve(var1), keep(numeric)

. list

     +--------------------------+
     |     var1   nvar1   nvar2 |
     |--------------------------|
  1. |     var1       1       1 |
  2. |     14 ³      14      14 |
  3. |      15·      15      15 |
  4. |       ¹6       6       6 |
  5. | 16 Ñ³ñ³í      16      16 |
     +--------------------------+

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#9

05 Mar 2015, 20:40

Clyde, can you cite any Stata documentation that describes in reasonable detail regular expressions acceptable to Stata? Both help regexm and [D] only say "based on Henry Spencer's NFA algorithm" and "nearly identical to the POSIX.2 standard", and search regexm yields a misleading description from 2005 that is apparently badly outdated. In a few earlier postings I ranted mildly against the inadequate documentation and lack of support for sophisticated regular expressions based on what little documentation I could find with help and search. Your example has shown that the "lack of support" part of the rants was incorrect; perhaps the "documentation" part was also incorrect.

Thanks for any advice, and thanks especially for the example you posted above.
Comment
Kristen Himelein

Join Date: Apr 2014

Posts: 10
#10

05 Mar 2015, 22:12

Thanks everyone. The "real regexs" command worked well so far, but I need to loop this over 821000 variables, so the others may come in handy for any special cases that arise.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#11

06 Mar 2015, 00:56

Originally posted by Kristen Himelein View Post

... I need to loop this over 821000 variables, ....

Probably "values". Stata has a limit of 32,767 variables. Sergiy
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#12

06 Mar 2015, 08:06

@William. No, actually I don't know of any place with a good summary of regular expressions acceptable to Stata. I don't use regular expressions very often. When I need to refresh my memory I just Google regular expressions POSIX and explore the hits that I get that way. I agree that the Stata documentation should be upgraded to include this material.
Comment

Announcement