Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cleaning String Variable (address)

    I have a dataset containing mailing addresses, which included all address information in a single cell. I have parsed out city, state, and zipcode into distinct variables, but now I am looking for an efficient way to remove the city name from the address variable.

    Here's an example of what I mean:
    v1 v2 v3 v4 v5
    112 W MAIN DR CHICAGO, TX 71102 112 W MAIN DR CHICAGO CHICAGO TX 71102
    4596 EAST ST. WEST SAN DIEGO, MA 98550 4596 EAST ST. WEST SAN DIEGO SAN DIEGO MA 98550
    v1 was the original variable - with a comma separating only the state and zipcode from the rest of the address
    v2 - v5 are variables I created by parsing on the comma and extracting the state and zipcode.

    Is there a simple way to remove the characters in v2 that exactly match characters in v3 - giving me a clean street address without the city name?

    Thank you!

  • #2
    Code:
    replace v2= subinstr(v2, v3, "", .)

    Comment


    • #3
      The move of San Diego to Massachusetts and Chicago to Texas will require some considerable rewriting of geographical texts. Please stop it now.

      Comment


      • #4
        Thanks Andrew - worked perfectly.

        I also worked out another method using regexr to remove the last word from the address string

        Code:

        gen streetadd=regexr(v2, " [^ ]+$", "")
        replace streetadd=regexr(streetadd, " [^ ]+$", "") if cityspace==1


        The second line removes the last word again (from the new var) if the city name contains a space (cityspace==1) for cities like San Diego.


        *original post did not use original address data - geographical texts safe for now.
        Last edited by Lindsay Bing; 05 Mar 2020, 11:11.

        Comment

        Working...
        X