Cleaning String Variable (address)

Lindsay Bing

Join Date: Mar 2020

Posts: 2
#1

Cleaning String Variable (address)

05 Mar 2020, 10:09

I have a dataset containing mailing addresses, which included all address information in a single cell. I have parsed out city, state, and zipcode into distinct variables, but now I am looking for an efficient way to remove the city name from the address variable.

Here's an example of what I mean:

v1 v2 v3 v4 v5

112 W MAIN DR CHICAGO, TX 71102 112 W MAIN DR CHICAGO CHICAGO TX 71102

4596 EAST ST. WEST SAN DIEGO, MA 98550 4596 EAST ST. WEST SAN DIEGO SAN DIEGO MA 98550

v1 was the original variable - with a comma separating only the state and zipcode from the rest of the address
v2 - v5 are variables I created by parsing on the comma and extracting the state and zipcode.

Is there a simple way to remove the characters in v2 that exactly match characters in v3 - giving me a clean street address without the city name?

Thank you!
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10194
#2

05 Mar 2020, 10:36

Code:

replace v2= subinstr(v2, v3, "", .)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#3

05 Mar 2020, 11:02

The move of San Diego to Massachusetts and Chicago to Texas will require some considerable rewriting of geographical texts. Please stop it now.
Comment
Lindsay Bing

Join Date: Mar 2020

Posts: 2
#4

05 Mar 2020, 11:06

Thanks Andrew - worked perfectly.

I also worked out another method using regexr to remove the last word from the address string

Code:

gen streetadd=regexr(v2, " [^ ]+$", "")
replace streetadd=regexr(streetadd, " [^ ]+$", "") if cityspace==1

The second line removes the last word again (from the new var) if the city name contains a space (cityspace==1) for cities like San Diego.

*original post did not use original address data - geographical texts safe for now.

Last edited by Lindsay Bing; 05 Mar 2020, 11:11.
Comment

v1	v2	v3	v4	v5
112 W MAIN DR CHICAGO, TX 71102	112 W MAIN DR CHICAGO	CHICAGO	TX	71102
4596 EAST ST. WEST SAN DIEGO, MA 98550	4596 EAST ST. WEST SAN DIEGO	SAN DIEGO	MA	98550

Announcement

Cleaning String Variable (address)

Comment

Comment

Comment