Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge datasets with multiple identifiers and a common one

    Hello,

    I am trying to merge 5 different datasets with information about each subnational entity of the world (region). The identifiers by file are as follows:

    A) 1 identified by country's isocode, country's name and region's name
    B) 1 identified by country's isocode, country's name and a region's isocode
    C) 1 identified by country's name and region's name
    D) 1 identified by country's isocode and region's name
    E) 1 identified by region's name

    My first intuition was to try to merge the databases that contain region's name (as it is a common variable for 4 out of the 5 bases) but I am struggling to find a code that allows discrepancies among the region's name across databases (something like Levensthein distance, but being calculated across diferent bases). I want to know if this is feasible in Stata or if I should look toward other programs like Python to do this fuzzy matching.

    Thanks!

  • #2
    Before you resort to fuzzy matching, look into Rafal Raciborski's -kountry- package which translates among a variety of country and region codes and naming systems. It may resolve your problem by enabling you to create a uniform set of identifier variables across all five data sets.

    If not, for fuzzy matching, most people would recommend either Julio Raffo's -matchit- package or Michael Blasnik's -reclink- package.

    All of these packages are available from SSC.

    Comment

    Working...
    X