Searching one dataset for a variable within another data set

Nicholas Stowell

Join Date: Sep 2020

Posts: 16
#1

Searching one dataset for a variable within another data set

22 Jan 2022, 14:49

I have two .dta files. The first contains three variables:

city - city name
county - the corresponding county name
zips - every zip code corresponding to the given city in a string format (for example: "92508 92503 92501 92505 92504 92507 92506 92502 92513 92514 92516 92517 92521 92522")

The second file simply contains one variable: zip_code (for example: "92505")

My aim is to create a variable in the second .dta file for county using the first .dta file. I can't think of a way to do so. Any advice?

**Note: I cannot use dataex for this post because the "data exceeds max linesize".
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

22 Jan 2022, 16:02

I made up a couple of toy data sets that seem to conform to your description of the two data sets you are actually working with.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str5 city str7 county str80 zips "city1" "county1" "92508 92503 92501 92505 92504 92507 92506 92502 92513 92514 92516 92517 92521 92" "city2" "county2" "93969 93053 93997 93968 93190 93971" end tempfile dataset1 save `dataset1' * Example generated by -dataex-. For more info, type help dataex clear input str5 zip "92501" "92502" "92503" "92504" "92515" "93053" "93190" "93968" "94012" end tempfile dataset2 save `dataset2' // RE-ORGANIZE DATA SET 1 SO IT IS USABLE use `dataset1' split zips, gen(zip) drop zips gen long obs_no = _n reshape long zip, i(obs_no) drop if missing(zip) keep city county zip duplicates drop isid zip, sort // NOW MERGE WITH DATA SET 2 merge 1:m zip using `dataset2', keep(match using) nogenerate

This code will match up each zip code in data set 2 with the corresponding city and county from data set 1 (when there is a suitable match).

Now, there is a potential problem. The code relies on each zip in data set 1 only appearing in association with a single city and county. And it verifies this assumption, as the code will break if this assumption is not met. But I think that in real life this is probably untrue. There may well be zip codes that overlap cities or overlap counties. If that is the case, the problem you raise is, itself, ill-posed. If a zip in data set 2 appears associated to more than one city/county in data set 1, how would you decide which city/county to pair it up with in the final result?
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

23 Jan 2022, 08:57

It's rather worse than Clyde imagines. ZIP Codes are not assigned with reference to political boundaries. The Census uses ZIP Code Tabulation Areas, which are essentially ZIP Codes with some adjustment to overcome the difficulties in precisely defining the land area covered by each ZIP code. Using ZCTA data we get the following bad news.

Code:

import delimited https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt
keep zcta5 state county
format zcta5  %05.0f
format county %3.0f
format state  %2.0f
duplicates drop
collapse (count) N=county, by(zcta5 state)
tab N
drop N
duplicates drop
collapse (count) N=state, by(zcta5)
tab N

Code:

    (count) |
     county |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |     24,253       73.00       73.00
          2 |      7,023       21.14       94.14
          3 |      1,696        5.10       99.24
          4 |        234        0.70       99.95
          5 |         15        0.05       99.99
          6 |          2        0.01      100.00
------------+-----------------------------------
      Total |     33,223      100.00

    (count) |
      state |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |     33,017       99.69       99.69
          2 |        103        0.31      100.00
------------+-----------------------------------
      Total |     33,120      100.00

Added in edit: the reason the Total is 103 less in the state tab is that the county tab includes each of the 103 ZIP codes that are in two states twice - once with the number of counties in the first state, and once with the number of counties in the second state.

Last edited by William Lisowski; 23 Jan 2022, 09:11.

Announcement

Searching one dataset for a variable within another data set

Comment

Comment