Merging two datasets with same county names (missing counties in one set), different IDs

Katie Campbell

Join Date: Apr 2020

Posts: 4
#1

Merging two datasets with same county names (missing counties in one set), different IDs

08 Apr 2020, 16:38

Hi all, this is my first post here and also new to Stata so apologies if not clear!

I'm trying to merge the two datasets below - one lists just two variables (county name "mun" and county ID "cvegeo"), and the other is agricultural production data by county.

I want to assign the county ID variable "cvegeo" in the first dataset to the county name in the agricultural data set using the common county name variable, "mun".

At first, I tried encoding "mun" in both datasets and merging on that new numeric ID, but I am missing 21 counties from the agriculture dataset that are included in the cvegeo one (2305 vs. 2326), so the numeric values assigned to "mun" are different between sets.

Is it possible for me to merge based strictly on the actual words of the county name (i.e. "aguascalientes", "asientos", etc. - and probably "force" if there are any spelling discrepancies)? Or a way I can identify the 21 counties that are in the cvegeo dataset and drop them so when I encode "mun" the numeric values are the same between sets (apart from manually)? Different government agencies use different ID systems for counties, but the cvegeo code is one of the most commonly used identifiers across datasets I'm looking at, so I expect similar issues to keep popping up.

Thank you for your help!

Best,
Katie

CVEGEO:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str77 mun long cvegeo "AGUASCALIENTES" 1001 "ASIENTOS" 1002 "CALVILLO" 1003 "COSÍO" 1004 "JESÚS MARÍA" 1005 end

AGRICULTURAL PRODUCTION:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int year byte idstate str22 state int idmun str51 mun byte idcycle str16 agseason byte idwater str8 water long idmeasure str14 measure long idcrop str33 crop float(sown harvest wrecked prod yield rurprice) double mexprodval 2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno" 1 "Riego" 200201 "Tonelada" 5490000 "Avena forrajera en verde" 1258 1258 0 28861 22.94 647.73 18694136 2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno" 1 "Riego" 200201 "Tonelada" 5900000 "Cebada forrajera en verde" 50 50 0 1389 27.78 600 833400 2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno" 1 "Riego" 200201 "Tonelada" 9090000 "Triticale forrajero en verde" 235 235 0 7735.73 32.92 598.39 4628983.5 2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno" 1 "Riego" 200201 "Tonelada" 15050000 "Pastos y praderas" 595 595 0 29809.5 50.1 594.48 17721152 2018 1 "Aguascalientes" 1 "Aguascalientes" 2 "Primavera-Verano" 1 "Riego" 200201 "Tonelada" 5490000 "Avena forrajera en verde" 36 36 0 965.7 26.83 580.84 560917.1875 end
Tags: None

Justin Niakamal

Join Date: Aug 2017
Posts: 757

08 Apr 2020, 17:45

I can think of a couple of ways to do this. I guess it really depends on how messy the names are in the agricultural production dataset.

If you post the result of

Code:

contract state mun

from the agricultural production dataset I would have a better sense, but for now this works with your example.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str77 mun long cvegeo
"AGUASCALIENTES" 1001
"ASIENTOS"       1002
"CALVILLO"       1003
"COSÍO"         1004
"JESÚS MARÍA"  1005
end

gen state =  proper(mun)
tempfile CVEGEO
save `CVEGEO'


* Example generated by -dataex-. To install: ssc install dataex
clear
input int year byte idstate str22 state int idmun str51 mun byte idcycle str16 agseason byte idwater str8 water long idmeasure str14 measure long idcrop str33 crop float(sown harvest wrecked prod yield rurprice) double mexprodval
2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno"  1 "Riego" 200201 "Tonelada"  5490000 "Avena forrajera en verde"     1258 1258 0   28861 22.94 647.73    18694136
2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno"  1 "Riego" 200201 "Tonelada"  5900000 "Cebada forrajera en verde"      50   50 0    1389 27.78    600      833400
2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno"  1 "Riego" 200201 "Tonelada"  9090000 "Triticale forrajero en verde"  235  235 0 7735.73 32.92 598.39   4628983.5
2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno"  1 "Riego" 200201 "Tonelada" 15050000 "Pastos y praderas"             595  595 0 29809.5  50.1 594.48    17721152
2018 1 "Aguascalientes" 1 "Aguascalientes" 2 "Primavera-Verano" 1 "Riego" 200201 "Tonelada"  5490000 "Avena forrajera en verde"       36   36 0   965.7 26.83 580.84 560917.1875
end

merge m:1 state using `CVEGEO'

Comment

Katie Campbell

Join Date: Apr 2020
Posts: 4

10 Apr 2020, 13:17

Hi! Thank you so much. Here are the results from "contract state mun":

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str22 state str51 mun int _freq
"Aguascalientes"      "Aguascalientes"                490
"Aguascalientes"      "Asientos"                      650
"Aguascalientes"      "Calvillo"                      536
"Aguascalientes"      "Cosío"                        428
"Aguascalientes"      "El Llano"                      358
"Aguascalientes"      "Jesús María"                 696
"Aguascalientes"      "Pabellón de Arteaga"          503
"Aguascalientes"      "Rincón de Romos"              580
"Aguascalientes"      "San Francisco de Los Romo"     601
"Aguascalientes"      "San José de Gracia"           158
"Aguascalientes"      "Tepezalá"                     455
"Baja California"     "Ensenada"                     2455
"Baja California"     "Mexicali"                     3133
"Baja California"     "Playas de Rosarito"            513
"Baja California"     "Tecate"                        590
"Baja California"     "Tijuana"                       498
"Baja California Sur" "Comondú"                      817
"Baja California Sur" "La Paz"                       1192
"Baja California Sur" "Loreto"                         40
"Baja California Sur" "Los Cabos"                     972
"Baja California Sur" "Mulegé"                       737
end

I also realized that there are multiple municipalities with the same name but in different states I didn't account for (apologies!). So I tried to create a cvegeo code in the agriculture dataset based on the two "idmun" and "idstate" variables (which match what's used in the cvegeo codes but just separated/formatted differently) using these commands:

Code:

gen ids = "0" + string(idstate) 
gen ism = "00" + string(idmun) 
gen id = ids + ism
destring id, gen(cvegeo)

Which almost works but then i have the issue of too many zeros being added in the cvegeo codes I created at a certain point (i.e. I'm getting the same cvegeo code for "Aguascalientes" of 1001 with the code I tried, but 70015 instead of 7015 for "Cacahoatán").

Agriculture data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte idstate str22 state int idmun str51 mun long cvegeo int _freq
1 "Aguascalientes" 1 "Aguascalientes" 1001 490
1 "Aguascalientes" 2 "Asientos"       1002 650
1 "Aguascalientes" 3 "Calvillo"       1003 536
1 "Aguascalientes" 4 "Cosío"         1004 428
1 "Aguascalientes" 5 "Jesús María"  1005 696
end

Cvegeo data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str77 mun long cvegeo
"AGUASCALIENTES" 1001
"ASIENTOS"       1002
"CALVILLO"       1003
"COSÍO"         1004
"JESÚS MARÍA"  1005
end

Agriculture data:

Code:

***
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte idstate str22 state int idmun str51 mun long cvegeo int _freq
7 "Chiapas" 15 "Cacahoatán"            70015  72
7 "Chiapas" 16 "Catazajá"              70016 167
7 "Chiapas" 17 "Cintalapa"              70017 276
7 "Chiapas" 18 "Coapilla"               70018  84
7 "Chiapas" 19 "Comitán de Domínguez" 70019 121
7 "Chiapas" 20 "La Concordia"           70020 211
end

Cvegeo data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str77 mun long cvegeo
"CACAHOATÁN"            7015
"CATAZAJÁ"              7016
"CINTALAPA"              7017
"COAPILLA"               7018
"COMITÁN DE DOMÍNGUEZ" 7019
"LA CONCORDIA"           7020
end

The solution I was going to try next was, using the same code, just creating separate IDs for subsets of counties based on how many integers are in the state/mun id, and then combine into one cvegeo code after.

Thank you for your input, Justin! Please let know what you think and any suggestions. And apologies again for missing that bit of information (sorry still learning!)

Comment

Justin Niakamal

Join Date: Aug 2017
Posts: 757

13 Apr 2020, 09:14

Try this

Code:

use "AGRICULTURAL PRODUCTION.dta" , clear

gen munfix = ustrto(ustrnormalize(mun, "nfd"), "ascii", 2)
tempfile temp
save `temp'

use "CVEGEO.dta" , clear  

gen munfix = proper(ustrto(ustrnormalize(mun, "nfd"), "ascii", 2))
merge 1:m munfix using `temp'

Comment

Katie Campbell

Join Date: Apr 2020

Posts: 4
#5

15 Apr 2020, 17:16

This worked - thank you so much again!
Comment

Announcement