Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing numeric characters in a string variable

    Dear all,

    I'm struggeling with removing the numeric part of a string variable that contains both: numeric and non-numeric characters. An example of my data looks like this:
    Code:
     . list test in 1/3
    
         +--------------------------------------------+
         |                                       test |
         |--------------------------------------------|
      1. |                            555 AA Saarland |
      2. |           515 AA Kaiserslautern – Pirmasens |
      3. |                                  -5 (leer) |
         +--------------------------------------------+
    However, what I would like to have is this:

    Code:
     . list test_new in 1/3
    
         +--------------------------------------------+
         |                                   test_new |
         |--------------------------------------------|
      1. |                                AA Saarland |
      2. |               AA Kaiserslautern – Pirmasens |
      3. |                                     (leer) |
         +--------------------------------------------+
    With
    Code:
    gen test_new = regexs(0) if regexm(test, "[a-zA-Z]+")
    I came the closest so far. But this code leaves me with:

    Code:
     . list test_new in 1/3
    
         +--------------------------------------------+
         |                                   test_new |
         |--------------------------------------------|
      1. |                                         AA |
      2. |                                         AA |
      3. |                                       leer |
         +--------------------------------------------+
    Does anyone have a Suggestion on how I could get rid of the numeric part? Or alternatively, how I could extend the regexm condition such that it keeps the parentheses as well as the part after "AA"?

    Best
    Christian

  • #2
    I think you can do this with sieve from -egenmore-.

    Code:
    ssc install egenmore 
     egen test_new = sieve(test), keep(alphabetic) egen test_new2 = sieve(test), omit(-+0123456789)
    Note that the code might be buggy, as I haven't actually tested it.

    Comment


    • #3
      Examples are great, but

      1. Please do use dataex (SSC) to give examples as input code. Your example still needs engineering followed by surgery to be imported by any one else. http://www.statalist.org/forums/help#stata does explain this within the text that every new message prompt asks you to read.

      2. What's general here and what's specific? In particular, getting rid of the first word would work for your example, but perhaps numeric content occurs elsewhere in other observations.

      Others might push harder on the regex door. I am a fan of the regular expression machinery, but I am a fan too of the string functions provided any way. Here there is a simple algorithm

      Code:
      * pseudcode 
      initialise newstring to empty
      
      foreach word of string {
          if word is numeric do nothing
          else add it to newstring
      }
      Here's an implementation:

      Code:
      clear
      input str40 test
      "555 AA Saarland"
      "515 AA Kaiserslautern Pirmasens"
      "-5 (leer)"
      end
      
      gen newtest = ""
      gen wc = wordcount(test)
      su wc, meanonly
      
      quietly forval j = 1/`r(max)' {
          replace newtest = newtest + word(test, `j') + " " if missing(real(word(test, `j')))
      }
      
      replace newtest = trim(itrim(newtest))
      
      list
      
      
           +--------------------------------------------------------------------+
           |                            test                       newtest   wc |
           |--------------------------------------------------------------------|
        1. |                 555 AA Saarland                   AA Saarland    3 |
        2. | 515 AA Kaiserslautern Pirmasens   AA Kaiserslautern Pirmasens    4 |
        3. |                       -5 (leer)                        (leer)    2 |
           +--------------------------------------------------------------------+
      Last edited by Nick Cox; 22 Dec 2016, 08:04.

      Comment


      • #4
        Thanks Nick, that solves it beautifully! Sorry about the data example, next time I'll use dataex.

        Best
        Christian

        Comment


        • #5
          While regular expressions may not be the best way of addressing this problem, a small change in the provided code produces the desired result for this very small sample of data. Starting from the data as provided by Nick's code, we have
          Code:
          . gen test_new = strtrim(regexs(0)) if regexm(test, "[a-zA-Z() ]+")
          
          . list, clean noobs
          
                                         test                      test_new  
                              555 AA Saarland                   AA Saarland  
              515 AA Kaiserslautern Pirmasens   AA Kaiserslautern Pirmasens  
                                    -5 (leer)                        (leer)

          Comment

          Working...
          X