Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replace zeroes with the letter O in address data using regular expressions

    I have a few hundred thousand address records that I need to clean, and part of my cleaning process is replacing all zero characters (0) with the letter O in words. But being addresses, this isn't a case for a simple subinstr command. So I'm trying to figure out how to use regular expressions to do this. I've tried the regex and ustrregex commands, but so far no luck. So I'm hoping someone here can help me.

    For example, take the following two addresses:
    Code:
    101 C0NC0RD AVE
    123 HIGHWAY 30 S0UTH
    In the first example, I want to replace the zeroes in C0NC0RD but not the zero in 101.
    In the second example, I want to replace the zero in S0UTH but not in 30.

    I tried using ustregrexra two different ways, but didn't get the right results:
    Code:
    replace address = ustrregexra(address, "[A-Z]0[A-Z]", "O")
    resulted in
    Code:
    101 OOD AVE
    123 HIGHWAY 30 OTH
    Code:
    replace address = ustrregexra(address, "[A-Z]0[A-Z]", "[A-Z]O[A-Z]")
    resulted in
    Code:
    101 [A-Z]O[A-Z][A-Z]O[A-Z]D AVE
    123 HIGHWAY 30 [A-Z]O[A-Z]TH
    Normally I would use ustrregexs to concatenate together elements of a match, but these are dynamic matches and I can't predict how many times the zeroes will appear.

    I also tried jregex after reading this post (https://www.statalist.org/forums/for...ar-expressions), but I don't have enough experience with Java regular expressions to make this work. I used this command:
    Code:
    jregex replace address, pattern("[A-Z\s]0[A-Z]") rep("O")
    That replaced the correct zeroes with Os but, as with the ustrregexra command, removed the characters on either side.

    Any help would be appreciated.

  • #2
    In the future, please present data examples using dataex. The following should work.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str29 address
    "101 C0NC0RD AVE"    
    "123 HIGHWAY 30 S0UTH"
    "000000000000000"    
    end
    
    gen address2 = ustrregexra(address, "0" , "O") if strpos(address,"0")
    Result:

    Code:
    . l
    
         +---------------------------------------------+
         |              address               address2 |
         |---------------------------------------------|
      1. |      101 C0NC0RD AVE        1O1 CONCORD AVE |
      2. | 123 HIGHWAY 30 S0UTH   123 HIGHWAY 3O SOUTH |
      3. |      000000000000000        OOOOOOOOOOOOOOO |
         +---------------------------------------------+
    ADDED IN EDIT: My apologies, upon re-reading your post, it appears you do not want to replace all instances. I will attempt a solution and get back to you.
    Last edited by Andrew Musau; 16 Oct 2018, 16:46.

    Comment


    • #3
      I think the current problem is trickier than it looks, as Andrew's solution will change even the legitimate zeros to "O". I learned here that ustrregexr() and and regexr() don't accept conventional regex replacement subexpression representations. (To me, the documentation did not exclude this possibility.)

      The best solution I can come up with requires repeated calls to ustrregexra() to replace all of the offending instances with 0s. Perhaps someone else has a nicer approach.
      Code:
       input str30 address
      "101 C0NC0RD AVE"
      "123 HIGHWAY 30 S0UTH"
      end
      local max = 10 // replace up to `max' bogus 0s in an address
      forval i = 1/`max' {
          replace address = ustrregexra(address, ustrregexs(1)+ "0" + ustrregexs(2), ustrregexs(1)+ "O" + ustrregexs(2) )  ///
                 if ustrregexm(address, "([A-Z])0([A-Z])")
      }

      Comment


      • #4
        Thanks Mike Lacy for the input. There is a way, but it is not very intuitive. Search for zeros not surrounded by other numbers

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str29 address
        "101 C0NC0RD AVE"     
        "123 HIGHWAY 30 S0UTH"
        "000000000000000"     
        end
        
        gen address2= ustrregexra(address, "(?<![0-9])(0)(?![0-9])", "O")
        and now the desired result

        Code:
        . l
        
             +---------------------------------------------+
             |              address               address2 |
             |---------------------------------------------|
          1. |      101 C0NC0RD AVE        101 CONCORD AVE |
          2. | 123 HIGHWAY 30 S0UTH   123 HIGHWAY 30 SOUTH |
          3. |      000000000000000        000000000000000 |
             +---------------------------------------------+

        Comment


        • #5
          Mike's solution in #3 is not serving well for "C0NC0RD", and it would be likely the same for words starting or ending with "0", like "0XF0RD" or "Z00".

          Andrew's in #4 is almost perfect, despite a small notice should be paid for exception cases like "Z00" or "H0LIW00D". One more added line might be an improvement.

          Code:
          gen address3= ustrregexra(address, "(?<![0-9])(0)(?![0-9])", "O")
          
          replace address3= ustrregexra(address3, "(?<![0-9])(00)(?![0-9])", "OO")
          Since this direct solution has been found by Andrew Musau , a tricky detour with -split-, would be serving just for ... fun.

          Code:
          split address, generate(x)
          split address, generate(y) destring force
          
          forval i=1/`r(nvars)' {
          replace x`i' = ustrregexra(x`i', "0" , "O") if y`i' ==.
          }
          egen address4 = concat(x*), p(" ")
          drop x* y*

          Comment


          • #6
            That was perfect. Thanks to both Andrew and Romalpa.

            Comment

            Working...
            X