New command pyconvertu in SSC

Ilya Bolotov

Join Date: Nov 2018
Posts: 75

New command pyconvertu in SSC

04 Jan 2021, 15:07

I'd like to thank professor Baum for uploading the command to SSC!

The pyconvertu command converts a string variable into a classification from the default or user-provided JSON file with the help of Python 3.

The default classification is ISO 3166-1 (ISO country codes and names) taken directly from https://www.iso.org/iso-3166-country-codes.html
The innovation is the use of regular expressions as opposed to direct conversion from PyPi pycountry (for example, shorter names like United States instead of United States of America, or spaces after names may cause problems in the latter). Python seems to be ideal for this kind of task. No PyPi modules are required to run the command, only a basic Python 3 installation, say Anaconda 3 or Miniconda (on Windows). MacOS and most Linux distributions have Python 3 pre-installed.

Requirements:

Stata 16 or newer
an executable of a Python installation (Python 3 or higher) set with the help of the python set exec command

Usage:

Code:

. pyconvertu varname, to(string)

performs the conversion of varname to the specified classification and is followed by generate(string), replace or print.

Code:

. pyconvertu __classification, to(string)

returns the specified classification as a whole and is followed by generate(string) or print.

Code:

. pyconvertu __info

prints metadata and sources, no options are required.

The command can easily transform country names in English to ISO 3-, 2-letter or numeric codes or build dimensions for a balanced panel dataset for the whole world (as of ISO's list of countries) in three lines of Stata code.
The help file also provides instructions on how to construct one's own JSON file and use it instead of the default classification.

Examples:

Code:

        * write the complete default JSON file (ISO 3166-1) to data
        . foreach s in "iso3" "iso2" "isoN" "name_en" "name_fr" {
        .     pyconvertu __classification, to(`s') gen(`s')
        . }

        * print metadata and sources for the default JSON file
        . pyconvertu __info

        * generate panel dimensions (ISO 3166-1 alpha-3 codes for the years 2000-2020)
        . clear
        . pyconvertu __classification, to(iso3) gen(iso3)
        . expand `=(2020 - 2000) + 1'
        . by iso3, sort: gen year = 2000 + (_n - 1)

        * convert ISO 3166-1 alpha-3 to ISO 3166-1 numeric (where possible) in a dataset
        . sic install wbopendata
        . sysuse world-d, clear
        . pyconvertu countrycode, to(isoN) replace

        * same example, print the result of conversion instead of writing it to data
        . sysuse world-d, clear
        . pyconvertu countrycode, to(isoN) print

More advanced examples taken from the xtimportu command:

Code:

        ****
        * Example 1. Population time series for the Czech Republic (a country in Central Europe, EU member since 2004)
        ****

        * RegEx for the indicator, case sensitive!
        * unoptimized, illustration only
        . local regex "Počet"

        * ČSÚ's (Czech Statistical Office) file URL for Population
        . local url "https://www.czso.cz/documents/10180/123502877/32018120_0101.xlsx/d60b89c8-980c-4f3a-bc0c-46f38b0b8681?version=1.0"

        * import the time series data to memory, unit: thousand
        . xtimportu excel "`url'", cellrange(A3) regex(`regex') encode("Czech Republic") tfreq(Y) tde clear

        * revert underscores to spaces in the unit
        . replace unit = ustrregexra(unit, "_", " ")

        * tsset data
        . tsset year

        * convert country name to ISO 3166-1 alpha-3
        pyconvertu unit, to(iso3) replace

        ****
        * Example 2. FDI matrix from UNCTAD's Bilateral FDI statistics (historical data, 2000–2014)
        ****

        * RegEx for the EU-28, case sensitive! "{0,}$" (0 or more non-word characters) excludes Netherlands Antilles
        * unoptimized, illustration only
        . local regex "`regex'Austria|Belgium|Bulgaria|Croatia|Cyprus|Czech Republic|Denmark|Estonia|"
        . local regex "`regex'Finland|France|Germany|Greece|Hungary|Ireland|Italy|Latvia|Lithuania|"
        . local regex "`regex'Luxembourg|Malta|Netherlands\W{0,}$|Poland|Portugal|Romania|"
        . local regex "`regex'Slovakia|Slovenia|Spain|Sweden|United Kingdom"

        * UNCTAD's (United Nations Conference on Trade and Development) file URL for the U.S.
        . local url "https://unctad.org/system/files/non-official-document/webdiaeia2014d3_USA.xls"

        * import the panel data to memory, export a copy as a CSV file
        . xtimportu excel "`url'", sheet("inflows") cellrange(E5) regex(`regex') tfreq(Y) clear tde export(delimited "./usa_fdi_matrix.csv", replace)

        * rename variables to form the 28x1 aka the EU-28 x U.S. FDI matrix, unit: million USD
        . rename unit from
        . rename value to_USA

        * xtset data
        . encode from, gen(id)
        . xtset id year

        * convert country names to ISO 3166-1 alpha-3
        pyconvertu from, to(iso3) replace

Last edited by Ilya Bolotov; 04 Jan 2021, 15:12.

Tags: None

Ilya Bolotov

Join Date: Nov 2018

Posts: 75
#2

04 Jan 2021, 15:10

If you find an correct conversion for ISO 3166-1, please post it here.
I'll fix the RegEx in the JSON file.
Comment
Ilya Bolotov

Join Date: Nov 2018

Posts: 75
#3

12 Jul 2021, 03:02

Thanks to professor Baum, the command is updated.
The RegEx were re-tested, a couple imprecisions corrected, and one partially recognized country added. The JSON file now corresponds to the EU “standard”.

You can now create your own JSON files from data in memory with

Code:

pyconvertu __dump, to(filepath)

The data in memory should contain subheadings “Data:”, “Metadata:”, and “Sources:”, see the enclosed Excel file.
For example,

Code:

import excel “iso3166_classifications.xlsx”, clear pyconvertu __dump, to(iso3166_classifications.json) pyconvertu __classification, from(iso3166_classifications.json) to(iso3) print

Attached Files

iso3166_classifications.xlsx (31.7 KB, 1 view)
Comment
Ilya Bolotov

Join Date: Nov 2018

Posts: 75
#4

12 Jul 2021, 03:19

I’ve been thinking of writing RegEx for subnational classifications (regional codes) for pyconvertu.
It makes sense for big federal countries like the U.S., Mexico, Brazil, Russia, India, possibly China (although not a federation, PRC has many provinces).

The idea is to convert between ISO 3166-2 codes, local regional codes (such as postal or FIPS in the U.S.), names in the local languages, their numerous transliterations, and English/French/etc. names.
For example, ISO code <-> internal Chinese classification <-> Simplified Chinese <-> Traditional Chinese <-> Pinyin <-> Other transliteration.
Say you have plenty of official statistics in a local language for some 20–90 regions in an Excel file and you have some international statistics for the aforementioned regions in English.
Believe me, you wouldn’t want to merge them by hand.

In case someone is interested, please let me know.
Comment
Raymond Zhang

Join Date: Jan 2021

Posts: 349
#5

14 Jul 2021, 21:12

Dear IIya,I am so intereted in this command.especially converting between ISO code and internal Chinese classification or Simplified Chinese or Pinyin.

Best regards.

Raymond Zhang
Stata 17.0,MP
Comment
Ilya Bolotov

Join Date: Nov 2018

Posts: 75
#6

16 Jul 2021, 02:34

Originally posted by Raymond Zhang View Post

Dear IIya,I am so intereted in this command.especially converting between ISO code and internal Chinese classification or Simplified Chinese or Pinyin.

Dear Raymond,

Certainly. Maybe you can send me a private message so we can discuss the details and post the JSON file here later?
We can add any names and classifications you need. I strongly suspect PRC will be a popular topic, so preparing a decent JSON file is in many people’s interest.

PS You can already install the pyconvertu command, simply type

Code:

ssc install pyconvertu

.
Comment
Ilya Bolotov

Join Date: Nov 2018

Posts: 75
#7

27 May 2022, 12:48

I've prepared an ISO 639-2 classification of languages for my pyconvertu command, based on what is available at the U.S. LoC. If someone is interested, you can download the attached excel file.
The included codes are iso2 (sparse), iso3 (T, full), name_en, name_fr, and name_de (German). RegEx is configured for name_en, iso3 (both B and T), and iso2 (where available).
You can, for example, use it to encode CEPII GeoDist data or standardize lists of languages for some quantitative humanities research.

Code:

import excel "iso0639_classifications.xlsx", clear drop if _n > 489 pyconvertu __dump, to("iso0639-2_all") clear pyconvertu __classification, from(iso0639-2_all.json) to(iso2) gen(iso2) pyconvertu __classification, from(iso0639-2_all.json) to(iso3) gen(iso3) pyconvertu __classification, from(iso0639-2_all.json) to(name_en) gen(name_en)

I tested the RegEx on the excel itself, it worked well but may not yet be 100% bug-proof.
Attached Files

iso0639_classifications.xlsx (47.9 KB, 1 view)
Comment
Ilya Bolotov

Join Date: Nov 2018

Posts: 75
#8

30 May 2022, 13:32

I've compiled one more classification, the ISO 4217 (currency codes), and re-tested the existing ones to remove bugs in regular expressions.
So far, after a standardization of sorts, the following classifications are available:
ISO 3166-1, two-letter, three-letter, and numeric country codes with names in English and French (just above 200); – the default one which comes with the pyconvertu command in SSC.

ISO 4217 (April 2022), three-letter and numeric currency codes with names in English and French, and numeric units (just below 200);

ISO 639-2, two- and three-letter language codes with names in English and French (just under 500).

After some consideration, I've decided to post the files here so that an eventual enthusiast could extend the translation to other languages such as Mandarin, Hindi, Arabic, etc.
Could you please post the result(s) here if you do so? Simply add columns based on the logic of the enclosed Excel files. Your contribution is highly appreciated!

iso3166_classifications.xlsx
iso4217_classifications.xlsx
iso0639_classifications.xlsx

How do you construct useable JSON files for pyconvertu with the from() option? It's now straightforward:

Code:

import excel <EXCEL FILE>, clear pyconvertu __dump, to(<JSON FILE>)

How does it all work?
pyconvertu can, using regular expressions, convert English names of countries, currencies, and languages into the ISO codes and vice versa! It can also translate the mentioned names into available languages!
For example,

Code:

pyconvertu __classification, to(name_en) gen(country) // will create a list of English country names pyconvertu country, to(iso3) gen(iso3) // will convert the country names into ISO 3166 alpha-3 codes pyconvertu iso3, to(name_en) gen(name_en) // will reverse the last operation pyconvertu name_en, to(name_fr) gen(name_fr) // will translate 200+ English country names into French just like that! Or to your language if you extend the existing Excel files.

What is it good for?
Merging cross-sectional or panel data of any kind. Suppose you have ISO country codes in one dataset (say, from the IMF) and country names in another (say, from UNCTAD).
pyconvertu will solve the problem for you in a line of code.
Comment

Announcement

New command pyconvertu in SSC

Comment

Comment

Comment

Comment

Comment

Comment

Comment