Hi all, this is my first post here and also new to Stata so apologies if not clear!
I'm trying to merge the two datasets below - one lists just two variables (county name "mun" and county ID "cvegeo"), and the other is agricultural production data by county.
I want to assign the county ID variable "cvegeo" in the first dataset to the county name in the agricultural data set using the common county name variable, "mun".
At first, I tried encoding "mun" in both datasets and merging on that new numeric ID, but I am missing 21 counties from the agriculture dataset that are included in the cvegeo one (2305 vs. 2326), so the numeric values assigned to "mun" are different between sets.
Is it possible for me to merge based strictly on the actual words of the county name (i.e. "aguascalientes", "asientos", etc. - and probably "force" if there are any spelling discrepancies)? Or a way I can identify the 21 counties that are in the cvegeo dataset and drop them so when I encode "mun" the numeric values are the same between sets (apart from manually)? Different government agencies use different ID systems for counties, but the cvegeo code is one of the most commonly used identifiers across datasets I'm looking at, so I expect similar issues to keep popping up.
Thank you for your help!
Best,
Katie
CVEGEO:
AGRICULTURAL PRODUCTION:
I'm trying to merge the two datasets below - one lists just two variables (county name "mun" and county ID "cvegeo"), and the other is agricultural production data by county.
I want to assign the county ID variable "cvegeo" in the first dataset to the county name in the agricultural data set using the common county name variable, "mun".
At first, I tried encoding "mun" in both datasets and merging on that new numeric ID, but I am missing 21 counties from the agriculture dataset that are included in the cvegeo one (2305 vs. 2326), so the numeric values assigned to "mun" are different between sets.
Is it possible for me to merge based strictly on the actual words of the county name (i.e. "aguascalientes", "asientos", etc. - and probably "force" if there are any spelling discrepancies)? Or a way I can identify the 21 counties that are in the cvegeo dataset and drop them so when I encode "mun" the numeric values are the same between sets (apart from manually)? Different government agencies use different ID systems for counties, but the cvegeo code is one of the most commonly used identifiers across datasets I'm looking at, so I expect similar issues to keep popping up.
Thank you for your help!
Best,
Katie
CVEGEO:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str77 mun long cvegeo "AGUASCALIENTES" 1001 "ASIENTOS" 1002 "CALVILLO" 1003 "COSÍO" 1004 "JESÚS MARÍA" 1005 end
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input int year byte idstate str22 state int idmun str51 mun byte idcycle str16 agseason byte idwater str8 water long idmeasure str14 measure long idcrop str33 crop float(sown harvest wrecked prod yield rurprice) double mexprodval 2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno" 1 "Riego" 200201 "Tonelada" 5490000 "Avena forrajera en verde" 1258 1258 0 28861 22.94 647.73 18694136 2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno" 1 "Riego" 200201 "Tonelada" 5900000 "Cebada forrajera en verde" 50 50 0 1389 27.78 600 833400 2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno" 1 "Riego" 200201 "Tonelada" 9090000 "Triticale forrajero en verde" 235 235 0 7735.73 32.92 598.39 4628983.5 2018 1 "Aguascalientes" 1 "Aguascalientes" 1 "Otoño-Invierno" 1 "Riego" 200201 "Tonelada" 15050000 "Pastos y praderas" 595 595 0 29809.5 50.1 594.48 17721152 2018 1 "Aguascalientes" 1 "Aguascalientes" 2 "Primavera-Verano" 1 "Riego" 200201 "Tonelada" 5490000 "Avena forrajera en verde" 36 36 0 965.7 26.83 580.84 560917.1875 end