Removing numeric characters in a string variable

Christian Agethen

Join Date: Nov 2014
Posts: 27

Removing numeric characters in a string variable

22 Dec 2016, 07:19

Dear all,

I'm struggeling with removing the numeric part of a string variable that contains both: numeric and non-numeric characters. An example of my data looks like this:

Code:

 . list test in 1/3

     +--------------------------------------------+
     |                                       test |
     |--------------------------------------------|
  1. |                            555 AA Saarland |
  2. |           515 AA Kaiserslautern  Pirmasens |
  3. |                                  -5 (leer) |
     +--------------------------------------------+

However, what I would like to have is this:

Code:

 . list test_new in 1/3

     +--------------------------------------------+
     |                                   test_new |
     |--------------------------------------------|
  1. |                                AA Saarland |
  2. |               AA Kaiserslautern  Pirmasens |
  3. |                                     (leer) |
     +--------------------------------------------+

With

Code:

gen test_new = regexs(0) if regexm(test, "[a-zA-Z]+")

I came the closest so far. But this code leaves me with:

Code:

 . list test_new in 1/3

     +--------------------------------------------+
     |                                   test_new |
     |--------------------------------------------|
  1. |                                         AA |
  2. |                                         AA |
  3. |                                       leer |
     +--------------------------------------------+

Does anyone have a Suggestion on how I could get rid of the numeric part? Or alternatively, how I could extend the regexm condition such that it keeps the parentheses as well as the part after "AA"?

Best
Christian

Tags: None

Jesse Wursten

Join Date: Jan 2016

Posts: 915
#2

22 Dec 2016, 07:27

I think you can do this with sieve from -egenmore-.

Code:

ssc install egenmore egen test_new = sieve(test), keep(alphabetic) egen test_new2 = sieve(test), omit(-+0123456789)

Note that the code might be buggy, as I haven't actually tested it.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35809

22 Dec 2016, 08:00

Examples are great, but

1. Please do use dataex (SSC) to give examples as input code. Your example still needs engineering followed by surgery to be imported by any one else. http://www.statalist.org/forums/help#stata does explain this within the text that every new message prompt asks you to read.

2. What's general here and what's specific? In particular, getting rid of the first word would work for your example, but perhaps numeric content occurs elsewhere in other observations.

Others might push harder on the regex door. I am a fan of the regular expression machinery, but I am a fan too of the string functions provided any way. Here there is a simple algorithm

Code:

* pseudcode 
initialise newstring to empty

foreach word of string {
    if word is numeric do nothing
    else add it to newstring
}

Here's an implementation:

Code:

clear
input str40 test
"555 AA Saarland"
"515 AA Kaiserslautern Pirmasens"
"-5 (leer)"
end

gen newtest = ""
gen wc = wordcount(test)
su wc, meanonly

quietly forval j = 1/`r(max)' {
    replace newtest = newtest + word(test, `j') + " " if missing(real(word(test, `j')))
}

replace newtest = trim(itrim(newtest))

list


     +--------------------------------------------------------------------+
     |                            test                       newtest   wc |
     |--------------------------------------------------------------------|
  1. |                 555 AA Saarland                   AA Saarland    3 |
  2. | 515 AA Kaiserslautern Pirmasens   AA Kaiserslautern Pirmasens    4 |
  3. |                       -5 (leer)                        (leer)    2 |
     +--------------------------------------------------------------------+

Last edited by Nick Cox; 22 Dec 2016, 08:04.

Comment

Christian Agethen

Join Date: Nov 2014

Posts: 27
#4

22 Dec 2016, 08:37

Thanks Nick, that solves it beautifully! Sorry about the data example, next time I'll use dataex.

Best
Christian
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

22 Dec 2016, 08:38

While regular expressions may not be the best way of addressing this problem, a small change in the provided code produces the desired result for this very small sample of data. Starting from the data as provided by Nick's code, we have

Code:

. gen test_new = strtrim(regexs(0)) if regexm(test, "[a-zA-Z() ]+")

. list, clean noobs

                               test                      test_new  
                    555 AA Saarland                   AA Saarland  
    515 AA Kaiserslautern Pirmasens   AA Kaiserslautern Pirmasens  
                          -5 (leer)                        (leer)

Announcement

Removing numeric characters in a string variable

Comment

Comment

Comment

Comment