splitting Chinese addresses?

River Huang

Join Date: Mar 2016

Posts: 1899
#16

25 Feb 2022, 19:52

Dear Andrew, Suppose that I save county and county+town information in two separate Stata data file as attached. county.dta and county-town.dta. I have the same addresses to split.

Code:

clear input str144 t_addr2 "台北市市民大道三段2號5樓" "台北市南港區園區街3之1號G棟8樓" "508 彰化縣和美鎮西鄉路161巷2號" "302 新竹縣竹北市文興路一段372號" "320 桃園市中壢區桃園市中壢區中山路201號4樓" end

Do I need use frame to do this, and how? Thanks.
Attached Files

county.dta (984 Bytes, 1 view)

county-town.dta (8.3 KB, 1 view)

Ho-Chuan (River) Huang
Stata 17.0, MP(4)
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9947

#17

26 Feb 2022, 07:13

For the 239th observation, it should be
Code:
臺南市新市區
。Thus, I guess we should use your prior suggestion
Code:
gen county = ustrregexs(0) if ustrregexm(" " + t_addr2 + " ", "(`counties')")
to extract cities/counties (instead of gen county = ustrregexra(countytown,"(.*縣|.*市)?(.*)?", "$1")). Any comments?

I agree with your suggested solution. As we have the list of counties, there is no need to extract them using regular expressions and we avoid misclassifications as in observation 239. With the counties and towns in the datasets, there is no need for frames. Here is the full code using your datasets in #16:

Code:

use "county.dta", clear
levelsof county, local(counties) sep(|) clean
clear
input str144 t_addr2
"台北市市民大道三段2號5樓"                          
"台北市南港區園區街3之1號G棟8樓"                  
"508 彰化縣 和美鎮 西鄉路161巷2號"                  
"302 新竹縣竹北市文興路一段372號"                  
"320 桃園市 中壢區 桃園市中壢區中山路201號4樓"
end
gen county = ustrregexs(0) if ustrregexm(" " + t_addr2 + " ", "(`counties')")
preserve
use "county-town.dta", clear
gen county = ustrregexs(0) if ustrregexm(" " + countytown+ " ", "(`counties')")
gen town = ustrregexra(countytown,county, "")
list in 239
levelsof town, local(towns) sep(|) clean
restore
gen town = ustrregexs(0) if ustrregexm(t_addr2, "(`towns')")
gen road= ustrregexra(ustrregexra(ustrregexra(t_addr2, county, "", 1), town, "", 1), "^([0-9]*)(.*街|.*路|.*大道)?(.*)?", "$2")

Res.:

Code:

. list in 239

     +--------------------------------+
     |   countytown   county     town |
     |--------------------------------|
239. | 臺南市新市區   臺南市   新市區 |
     +--------------------------------+

. l

     +----------------------------------------------------------------------------+
     |                                      t_addr2   county     town        road |
     |----------------------------------------------------------------------------|
  1. |                     台北市市民大道三段2號5樓   台北市             市民大道 |
  2. |               台北市南港區園區街3之1號G棟8樓   台北市   南港區      園區街 |
  3. |             508 彰化縣 和美鎮 西鄉路161巷2號   彰化縣   和美鎮      西鄉路 |
  4. |              302 新竹縣竹北市文興路一段372號   新竹縣   竹北市      文興路 |
  5. | 320 桃園市 中壢區 桃園市中壢區中山路201號4樓   桃園市   中壢區      中山路 |
     +----------------------------------------------------------------------------+

Comment

River Huang

Join Date: Mar 2016

Posts: 1899
#18

26 Feb 2022, 17:30

Dear Andrew, Thanka again for your very useful suggestions.

Ho-Chuan (River) Huang
Stata 17.0, MP(4)
Comment

Announcement

Comment

Comment

Comment