Hello,
I am trying to merge 5 different datasets with information about each subnational entity of the world (region). The identifiers by file are as follows:
A) 1 identified by country's isocode, country's name and region's name
B) 1 identified by country's isocode, country's name and a region's isocode
C) 1 identified by country's name and region's name
D) 1 identified by country's isocode and region's name
E) 1 identified by region's name
My first intuition was to try to merge the databases that contain region's name (as it is a common variable for 4 out of the 5 bases) but I am struggling to find a code that allows discrepancies among the region's name across databases (something like Levensthein distance, but being calculated across diferent bases). I want to know if this is feasible in Stata or if I should look toward other programs like Python to do this fuzzy matching.
Thanks!
I am trying to merge 5 different datasets with information about each subnational entity of the world (region). The identifiers by file are as follows:
A) 1 identified by country's isocode, country's name and region's name
B) 1 identified by country's isocode, country's name and a region's isocode
C) 1 identified by country's name and region's name
D) 1 identified by country's isocode and region's name
E) 1 identified by region's name
My first intuition was to try to merge the databases that contain region's name (as it is a common variable for 4 out of the 5 bases) but I am struggling to find a code that allows discrepancies among the region's name across databases (something like Levensthein distance, but being calculated across diferent bases). I want to know if this is feasible in Stata or if I should look toward other programs like Python to do this fuzzy matching.
Thanks!
Comment