Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging two datasets with same county names (missing counties in one set), different IDs

    Hi all, this is my first post here and also new to Stata so apologies if not clear!

    I'm trying to merge the two datasets below - one lists just two variables (county name "mun" and county ID "cvegeo"), and the other is agricultural production data by county.

    I want to assign the county ID variable "cvegeo" in the first dataset to the county name in the agricultural data set using the common county name variable, "mun".

    At first, I tried encoding "mun" in both datasets and merging on that new numeric ID, but I am missing 21 counties from the agriculture dataset that are included in the cvegeo one (2305 vs. 2326), so the numeric values assigned to "mun" are different between sets.

    Is it possible for me to merge based strictly on the actual words of the county name (i.e. "aguascalientes", "asientos", etc. - and probably "force" if there are any spelling discrepancies)? Or a way I can identify the 21 counties that are in the cvegeo dataset and drop them so when I encode "mun" the numeric values are the same between sets (apart from manually)? Different government agencies use different ID systems for counties, but the cvegeo code is one of the most commonly used identifiers across datasets I'm looking at, so I expect similar issues to keep popping up.

    Thank you for your help!

    Best,
    Katie

    CVEGEO:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str77 mun long cvegeo
    "AGUASCALIENTES" 1001
    "ASIENTOS"       1002
    "CALVILLO"       1003
    "COSÍO"         1004
    "JESÚS MARÍA"  1005
    end
    AGRICULTURAL PRODUCTION:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int year byte idstate str22 state int idmun str51 mun byte idcycle str16 agseason byte idwater str8 water long idmeasure str14 measure long idcrop str33 crop float(sown harvest wrecked prod yield rurprice) double mexprodval
    2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno"  1 "Riego" 200201 "Tonelada"  5490000 "Avena forrajera en verde"     1258 1258 0   28861 22.94 647.73    18694136
    2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno"  1 "Riego" 200201 "Tonelada"  5900000 "Cebada forrajera en verde"      50   50 0    1389 27.78    600      833400
    2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno"  1 "Riego" 200201 "Tonelada"  9090000 "Triticale forrajero en verde"  235  235 0 7735.73 32.92 598.39   4628983.5
    2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno"  1 "Riego" 200201 "Tonelada" 15050000 "Pastos y praderas"             595  595 0 29809.5  50.1 594.48    17721152
    2018 1 "Aguascalientes" 1 "Aguascalientes" 2 "Primavera-Verano" 1 "Riego" 200201 "Tonelada"  5490000 "Avena forrajera en verde"       36   36 0   965.7 26.83 580.84 560917.1875
    end

  • #2
    I can think of a couple of ways to do this. I guess it really depends on how messy the names are in the agricultural production dataset.

    If you post the result of
    Code:
    contract state mun
    from the agricultural production dataset I would have a better sense, but for now this works with your example.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str77 mun long cvegeo
    "AGUASCALIENTES" 1001
    "ASIENTOS"       1002
    "CALVILLO"       1003
    "COSÍO"         1004
    "JESÚS MARÍA"  1005
    end
    
    gen state =  proper(mun)
    tempfile CVEGEO
    save `CVEGEO'
    
    
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int year byte idstate str22 state int idmun str51 mun byte idcycle str16 agseason byte idwater str8 water long idmeasure str14 measure long idcrop str33 crop float(sown harvest wrecked prod yield rurprice) double mexprodval
    2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno"  1 "Riego" 200201 "Tonelada"  5490000 "Avena forrajera en verde"     1258 1258 0   28861 22.94 647.73    18694136
    2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno"  1 "Riego" 200201 "Tonelada"  5900000 "Cebada forrajera en verde"      50   50 0    1389 27.78    600      833400
    2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno"  1 "Riego" 200201 "Tonelada"  9090000 "Triticale forrajero en verde"  235  235 0 7735.73 32.92 598.39   4628983.5
    2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno"  1 "Riego" 200201 "Tonelada" 15050000 "Pastos y praderas"             595  595 0 29809.5  50.1 594.48    17721152
    2018 1 "Aguascalientes" 1 "Aguascalientes" 2 "Primavera-Verano" 1 "Riego" 200201 "Tonelada"  5490000 "Avena forrajera en verde"       36   36 0   965.7 26.83 580.84 560917.1875
    end
    
    merge m:1 state using `CVEGEO'

    Comment


    • #3
      Hi! Thank you so much. Here are the results from "contract state mun":

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str22 state str51 mun int _freq
      "Aguascalientes"      "Aguascalientes"                490
      "Aguascalientes"      "Asientos"                      650
      "Aguascalientes"      "Calvillo"                      536
      "Aguascalientes"      "Cosío"                        428
      "Aguascalientes"      "El Llano"                      358
      "Aguascalientes"      "Jesús María"                 696
      "Aguascalientes"      "Pabellón de Arteaga"          503
      "Aguascalientes"      "Rincón de Romos"              580
      "Aguascalientes"      "San Francisco de Los Romo"     601
      "Aguascalientes"      "San José de Gracia"           158
      "Aguascalientes"      "Tepezalá"                     455
      "Baja California"     "Ensenada"                     2455
      "Baja California"     "Mexicali"                     3133
      "Baja California"     "Playas de Rosarito"            513
      "Baja California"     "Tecate"                        590
      "Baja California"     "Tijuana"                       498
      "Baja California Sur" "Comondú"                      817
      "Baja California Sur" "La Paz"                       1192
      "Baja California Sur" "Loreto"                         40
      "Baja California Sur" "Los Cabos"                     972
      "Baja California Sur" "Mulegé"                       737
      end

      I also realized that there are multiple municipalities with the same name but in different states I didn't account for (apologies!). So I tried to create a cvegeo code in the agriculture dataset based on the two "idmun" and "idstate" variables (which match what's used in the cvegeo codes but just separated/formatted differently) using these commands:

      Code:
      gen ids = "0" + string(idstate) 
      gen ism = "00" + string(idmun) 
      gen id = ids + ism
      destring id, gen(cvegeo)
      Which almost works but then i have the issue of too many zeros being added in the cvegeo codes I created at a certain point (i.e. I'm getting the same cvegeo code for "Aguascalientes" of 1001 with the code I tried, but 70015 instead of 7015 for "Cacahoatán").

      Agriculture data:
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte idstate str22 state int idmun str51 mun long cvegeo int _freq
      1 "Aguascalientes" 1 "Aguascalientes" 1001 490
      1 "Aguascalientes" 2 "Asientos"       1002 650
      1 "Aguascalientes" 3 "Calvillo"       1003 536
      1 "Aguascalientes" 4 "Cosío"         1004 428
      1 "Aguascalientes" 5 "Jesús María"  1005 696
      end
      Cvegeo data:
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str77 mun long cvegeo
      "AGUASCALIENTES" 1001
      "ASIENTOS"       1002
      "CALVILLO"       1003
      "COSÍO"         1004
      "JESÚS MARÍA"  1005
      end
      Agriculture data:
      Code:
      ***
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte idstate str22 state int idmun str51 mun long cvegeo int _freq
      7 "Chiapas" 15 "Cacahoatán"            70015  72
      7 "Chiapas" 16 "Catazajá"              70016 167
      7 "Chiapas" 17 "Cintalapa"              70017 276
      7 "Chiapas" 18 "Coapilla"               70018  84
      7 "Chiapas" 19 "Comitán de Domínguez" 70019 121
      7 "Chiapas" 20 "La Concordia"           70020 211
      end
      Cvegeo data:
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str77 mun long cvegeo
      "CACAHOATÁN"            7015
      "CATAZAJÁ"              7016
      "CINTALAPA"              7017
      "COAPILLA"               7018
      "COMITÁN DE DOMÍNGUEZ" 7019
      "LA CONCORDIA"           7020
      end
      The solution I was going to try next was, using the same code, just creating separate IDs for subsets of counties based on how many integers are in the state/mun id, and then combine into one cvegeo code after.


      Thank you for your input, Justin! Please let know what you think and any suggestions. And apologies again for missing that bit of information (sorry still learning!)

      Comment


      • #4
        Try this

        Code:
        use "AGRICULTURAL PRODUCTION.dta" , clear
        
        gen munfix = ustrto(ustrnormalize(mun, "nfd"), "ascii", 2)
        tempfile temp
        save `temp'
        
        use "CVEGEO.dta" , clear  
        
        gen munfix = proper(ustrto(ustrnormalize(mun, "nfd"), "ascii", 2))
        merge 1:m munfix using `temp'

        Comment


        • #5
          This worked - thank you so much again!

          Comment

          Working...
          X