Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Searching one dataset for a variable within another data set

    I have two .dta files. The first contains three variables:

    city - city name
    county - the corresponding county name
    zips - every zip code corresponding to the given city in a string format (for example: "92508 92503 92501 92505 92504 92507 92506 92502 92513 92514 92516 92517 92521 92522")

    The second file simply contains one variable: zip_code (for example: "92505")

    My aim is to create a variable in the second .dta file for county using the first .dta file. I can't think of a way to do so. Any advice?

    **Note: I cannot use dataex for this post because the "data exceeds max linesize".



  • #2
    I made up a couple of toy data sets that seem to conform to your description of the two data sets you are actually working with.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str5 city str7 county str80 zips
    "city1" "county1" "92508 92503 92501 92505 92504 92507 92506 92502 92513 92514 92516 92517 92521 92"
    "city2" "county2" "93969 93053 93997 93968 93190 93971"                                            
    end
    tempfile dataset1
    save `dataset1'
    
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str5 zip
    "92501"
    "92502"
    "92503"
    "92504"
    "92515"
    "93053"
    "93190"
    "93968"
    "94012"
    end
    tempfile dataset2
    save `dataset2'
    
    //  RE-ORGANIZE DATA SET 1 SO IT IS USABLE
    use `dataset1'
    split zips, gen(zip)
    drop zips
    gen long obs_no = _n
    reshape long zip, i(obs_no)
    drop if missing(zip)
    keep city county zip
    duplicates drop
    isid zip, sort
    
    //  NOW MERGE WITH DATA SET 2
    merge 1:m zip using `dataset2', keep(match using) nogenerate
    This code will match up each zip code in data set 2 with the corresponding city and county from data set 1 (when there is a suitable match).

    Now, there is a potential problem. The code relies on each zip in data set 1 only appearing in association with a single city and county. And it verifies this assumption, as the code will break if this assumption is not met. But I think that in real life this is probably untrue. There may well be zip codes that overlap cities or overlap counties. If that is the case, the problem you raise is, itself, ill-posed. If a zip in data set 2 appears associated to more than one city/county in data set 1, how would you decide which city/county to pair it up with in the final result?

    Comment


    • #3
      It's rather worse than Clyde imagines. ZIP Codes are not assigned with reference to political boundaries. The Census uses ZIP Code Tabulation Areas, which are essentially ZIP Codes with some adjustment to overcome the difficulties in precisely defining the land area covered by each ZIP code. Using ZCTA data we get the following bad news.
      Code:
      import delimited https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt
      keep zcta5 state county
      format zcta5  %05.0f
      format county %3.0f
      format state  %2.0f
      duplicates drop
      collapse (count) N=county, by(zcta5 state)
      tab N
      drop N
      duplicates drop
      collapse (count) N=state, by(zcta5)
      tab N
      Code:
          (count) |
           county |      Freq.     Percent        Cum.
      ------------+-----------------------------------
                1 |     24,253       73.00       73.00
                2 |      7,023       21.14       94.14
                3 |      1,696        5.10       99.24
                4 |        234        0.70       99.95
                5 |         15        0.05       99.99
                6 |          2        0.01      100.00
      ------------+-----------------------------------
            Total |     33,223      100.00
      
          (count) |
            state |      Freq.     Percent        Cum.
      ------------+-----------------------------------
                1 |     33,017       99.69       99.69
                2 |        103        0.31      100.00
      ------------+-----------------------------------
            Total |     33,120      100.00
      Added in edit: the reason the Total is 103 less in the state tab is that the county tab includes each of the 103 ZIP codes that are in two states twice - once with the number of counties in the first state, and once with the number of counties in the second state.
      Last edited by William Lisowski; 23 Jan 2022, 10:11.

      Comment

      Working...
      X