I'd like to thank professor Baum for uploading the command to SSC!
The pyconvertu command converts a string variable into a classification from the default or user-provided JSON file with the help of Python 3.
The default classification is ISO 3166-1 (ISO country codes and names) taken directly from https://www.iso.org/iso-3166-country-codes.html
The innovation is the use of regular expressions as opposed to direct conversion from PyPi pycountry (for example, shorter names like United States instead of United States of America, or spaces after names may cause problems in the latter). Python seems to be ideal for this kind of task. No PyPi modules are required to run the command, only a basic Python 3 installation, say Anaconda 3 or Miniconda (on Windows). MacOS and most Linux distributions have Python 3 pre-installed.
Requirements:
performs the conversion of varname to the specified classification and is followed by generate(string), replace or print.
returns the specified classification as a whole and is followed by generate(string) or print.
prints metadata and sources, no options are required.
The command can easily transform country names in English to ISO 3-, 2-letter or numeric codes or build dimensions for a balanced panel dataset for the whole world (as of ISO's list of countries) in three lines of Stata code.
The help file also provides instructions on how to construct one's own JSON file and use it instead of the default classification.
Examples:
More advanced examples taken from the xtimportu command:
The pyconvertu command converts a string variable into a classification from the default or user-provided JSON file with the help of Python 3.
The default classification is ISO 3166-1 (ISO country codes and names) taken directly from https://www.iso.org/iso-3166-country-codes.html
The innovation is the use of regular expressions as opposed to direct conversion from PyPi pycountry (for example, shorter names like United States instead of United States of America, or spaces after names may cause problems in the latter). Python seems to be ideal for this kind of task. No PyPi modules are required to run the command, only a basic Python 3 installation, say Anaconda 3 or Miniconda (on Windows). MacOS and most Linux distributions have Python 3 pre-installed.
Requirements:
- Stata 16 or newer
- an executable of a Python installation (Python 3 or higher) set with the help of the python set exec command
Code:
. pyconvertu varname, to(string)
Code:
. pyconvertu __classification, to(string)
Code:
. pyconvertu __info
The command can easily transform country names in English to ISO 3-, 2-letter or numeric codes or build dimensions for a balanced panel dataset for the whole world (as of ISO's list of countries) in three lines of Stata code.
The help file also provides instructions on how to construct one's own JSON file and use it instead of the default classification.
Examples:
Code:
* write the complete default JSON file (ISO 3166-1) to data . foreach s in "iso3" "iso2" "isoN" "name_en" "name_fr" { . pyconvertu __classification, to(`s') gen(`s') . } * print metadata and sources for the default JSON file . pyconvertu __info * generate panel dimensions (ISO 3166-1 alpha-3 codes for the years 2000-2020) . clear . pyconvertu __classification, to(iso3) gen(iso3) . expand `=(2020 - 2000) + 1' . by iso3, sort: gen year = 2000 + (_n - 1) * convert ISO 3166-1 alpha-3 to ISO 3166-1 numeric (where possible) in a dataset . sic install wbopendata . sysuse world-d, clear . pyconvertu countrycode, to(isoN) replace * same example, print the result of conversion instead of writing it to data . sysuse world-d, clear . pyconvertu countrycode, to(isoN) print
Code:
**** * Example 1. Population time series for the Czech Republic (a country in Central Europe, EU member since 2004) **** * RegEx for the indicator, case sensitive! * unoptimized, illustration only . local regex "Počet" * ČSÚ's (Czech Statistical Office) file URL for Population . local url "https://www.czso.cz/documents/10180/123502877/32018120_0101.xlsx/d60b89c8-980c-4f3a-bc0c-46f38b0b8681?version=1.0" * import the time series data to memory, unit: thousand . xtimportu excel "`url'", cellrange(A3) regex(`regex') encode("Czech Republic") tfreq(Y) tde clear * revert underscores to spaces in the unit . replace unit = ustrregexra(unit, "_", " ") * tsset data . tsset year * convert country name to ISO 3166-1 alpha-3 pyconvertu unit, to(iso3) replace **** * Example 2. FDI matrix from UNCTAD's Bilateral FDI statistics (historical data, 2000–2014) **** * RegEx for the EU-28, case sensitive! "{0,}$" (0 or more non-word characters) excludes Netherlands Antilles * unoptimized, illustration only . local regex "`regex'Austria|Belgium|Bulgaria|Croatia|Cyprus|Czech Republic|Denmark|Estonia|" . local regex "`regex'Finland|France|Germany|Greece|Hungary|Ireland|Italy|Latvia|Lithuania|" . local regex "`regex'Luxembourg|Malta|Netherlands\W{0,}$|Poland|Portugal|Romania|" . local regex "`regex'Slovakia|Slovenia|Spain|Sweden|United Kingdom" * UNCTAD's (United Nations Conference on Trade and Development) file URL for the U.S. . local url "https://unctad.org/system/files/non-official-document/webdiaeia2014d3_USA.xls" * import the panel data to memory, export a copy as a CSV file . xtimportu excel "`url'", sheet("inflows") cellrange(E5) regex(`regex') tfreq(Y) clear tde export(delimited "./usa_fdi_matrix.csv", replace) * rename variables to form the 28x1 aka the EU-28 x U.S. FDI matrix, unit: million USD . rename unit from . rename value to_USA * xtset data . encode from, gen(id) . xtset id year * convert country names to ISO 3166-1 alpha-3 pyconvertu from, to(iso3) replace
Comment