Merge datasets with multiple identifiers and a common one

Exequiel Caceres

Join Date: Jan 2023

Posts: 5
#1

Merge datasets with multiple identifiers and a common one

22 Jan 2024, 14:54

Hello,

I am trying to merge 5 different datasets with information about each subnational entity of the world (region). The identifiers by file are as follows:

A) 1 identified by country's isocode, country's name and region's name
B) 1 identified by country's isocode, country's name and a region's isocode
C) 1 identified by country's name and region's name
D) 1 identified by country's isocode and region's name
E) 1 identified by region's name

My first intuition was to try to merge the databases that contain region's name (as it is a common variable for 4 out of the 5 bases) but I am struggling to find a code that allows discrepancies among the region's name across databases (something like Levensthein distance, but being calculated across diferent bases). I want to know if this is feasible in Stata or if I should look toward other programs like Python to do this fuzzy matching.

Thanks!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

22 Jan 2024, 15:05

Before you resort to fuzzy matching, look into Rafal Raciborski's -kountry- package which translates among a variety of country and region codes and naming systems. It may resolve your problem by enabling you to create a uniform set of identifier variables across all five data sets.

If not, for fuzzy matching, most people would recommend either Julio Raffo's -matchit- package or Michael Blasnik's -reclink- package.

All of these packages are available from SSC.
Comment

Announcement

Merge datasets with multiple identifiers and a common one

Comment