Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • New command pyconvertu in SSC

    I'd like to thank professor Baum for uploading the command to SSC!

    The pyconvertu command converts a string variable into a classification from the default or user-provided JSON file with the help of Python 3.

    The default classification is ISO 3166-1 (ISO country codes and names) taken directly from https://www.iso.org/iso-3166-country-codes.html
    The innovation is the use of regular expressions as opposed to direct conversion from PyPi pycountry (for example, shorter names like United States instead of United States of America, or spaces after names may cause problems in the latter). Python seems to be ideal for this kind of task. No PyPi modules are required to run the command, only a basic Python 3 installation, say Anaconda 3 or Miniconda (on Windows). MacOS and most Linux distributions have Python 3 pre-installed.

    Requirements:
    • Stata 16 or newer
    • an executable of a Python installation (Python 3 or higher) set with the help of the python set exec command
    Usage:
    Code:
    . pyconvertu varname, to(string)
    performs the conversion of varname to the specified classification and is followed by generate(string), replace or print.

    Code:
    . pyconvertu __classification, to(string)
    returns the specified classification as a whole and is followed by generate(string) or print.

    Code:
    . pyconvertu __info
    prints metadata and sources, no options are required.

    The command can easily transform country names in English to ISO 3-, 2-letter or numeric codes or build dimensions for a balanced panel dataset for the whole world (as of ISO's list of countries) in three lines of Stata code.
    The help file also provides instructions on how to construct one's own JSON file and use it instead of the default classification.

    Examples:
    Code:
            * write the complete default JSON file (ISO 3166-1) to data
            . foreach s in "iso3" "iso2" "isoN" "name_en" "name_fr" {
            .     pyconvertu __classification, to(`s') gen(`s')
            . }
    
            * print metadata and sources for the default JSON file
            . pyconvertu __info
    
            * generate panel dimensions (ISO 3166-1 alpha-3 codes for the years 2000-2020)
            . clear
            . pyconvertu __classification, to(iso3) gen(iso3)
            . expand `=(2020 - 2000) + 1'
            . by iso3, sort: gen year = 2000 + (_n - 1)
    
            * convert ISO 3166-1 alpha-3 to ISO 3166-1 numeric (where possible) in a dataset
            . sic install wbopendata
            . sysuse world-d, clear
            . pyconvertu countrycode, to(isoN) replace
    
            * same example, print the result of conversion instead of writing it to data
            . sysuse world-d, clear
            . pyconvertu countrycode, to(isoN) print
    More advanced examples taken from the xtimportu command:
    Code:
            ****
            * Example 1. Population time series for the Czech Republic (a country in Central Europe, EU member since 2004)
            ****
    
            * RegEx for the indicator, case sensitive!
            * unoptimized, illustration only
            . local regex "Počet"
    
            * ČSÚ's (Czech Statistical Office) file URL for Population
            . local url "https://www.czso.cz/documents/10180/123502877/32018120_0101.xlsx/d60b89c8-980c-4f3a-bc0c-46f38b0b8681?version=1.0"
    
            * import the time series data to memory, unit: thousand
            . xtimportu excel "`url'", cellrange(A3) regex(`regex') encode("Czech Republic") tfreq(Y) tde clear
    
            * revert underscores to spaces in the unit
            . replace unit = ustrregexra(unit, "_", " ")
    
            * tsset data
            . tsset year
    
            * convert country name to ISO 3166-1 alpha-3
            pyconvertu unit, to(iso3) replace
    
            ****
            * Example 2. FDI matrix from UNCTAD's Bilateral FDI statistics (historical data, 2000–2014)
            ****
    
            * RegEx for the EU-28, case sensitive! "{0,}$" (0 or more non-word characters) excludes Netherlands Antilles
            * unoptimized, illustration only
            . local regex "`regex'Austria|Belgium|Bulgaria|Croatia|Cyprus|Czech Republic|Denmark|Estonia|"
            . local regex "`regex'Finland|France|Germany|Greece|Hungary|Ireland|Italy|Latvia|Lithuania|"
            . local regex "`regex'Luxembourg|Malta|Netherlands\W{0,}$|Poland|Portugal|Romania|"
            . local regex "`regex'Slovakia|Slovenia|Spain|Sweden|United Kingdom"
    
            * UNCTAD's (United Nations Conference on Trade and Development) file URL for the U.S.
            . local url "https://unctad.org/system/files/non-official-document/webdiaeia2014d3_USA.xls"
    
            * import the panel data to memory, export a copy as a CSV file
            . xtimportu excel "`url'", sheet("inflows") cellrange(E5) regex(`regex') tfreq(Y) clear tde export(delimited "./usa_fdi_matrix.csv", replace)
    
            * rename variables to form the 28x1 aka the EU-28 x U.S. FDI matrix, unit: million USD
            . rename unit from
            . rename value to_USA
    
            * xtset data
            . encode from, gen(id)
            . xtset id year
    
            * convert country names to ISO 3166-1 alpha-3
            pyconvertu from, to(iso3) replace
    Last edited by Ilya Bolotov; 04 Jan 2021, 16:12.

  • #2
    If you find an correct conversion for ISO 3166-1, please post it here.
    I'll fix the RegEx in the JSON file.

    Comment


    • #3
      Thanks to professor Baum, the command is updated.
      The RegEx were re-tested, a couple imprecisions corrected, and one partially recognized country added. The JSON file now corresponds to the EU “standard”.

      You can now create your own JSON files from data in memory with
      Code:
      pyconvertu __dump, to(filepath)
      The data in memory should contain subheadings “Data:”, “Metadata:”, and “Sources:”, see the enclosed Excel file.
      For example,
      Code:
      import excel “iso3166_classifications.xlsx”, clear
      pyconvertu __dump, to(iso3166_classifications.json)
      pyconvertu __classification, from(iso3166_classifications.json) to(iso3) print
      Attached Files

      Comment


      • #4
        I’ve been thinking of writing RegEx for subnational classifications (regional codes) for pyconvertu.
        It makes sense for big federal countries like the U.S., Mexico, Brazil, Russia, India, possibly China (although not a federation, PRC has many provinces).

        The idea is to convert between ISO 3166-2 codes, local regional codes (such as postal or FIPS in the U.S.), names in the local languages, their numerous transliterations, and English/French/etc. names.
        For example, ISO code <-> internal Chinese classification <-> Simplified Chinese <-> Traditional Chinese <-> Pinyin <-> Other transliteration.
        Say you have plenty of official statistics in a local language for some 20–90 regions in an Excel file and you have some international statistics for the aforementioned regions in English.
        Believe me, you wouldn’t want to merge them by hand.

        In case someone is interested, please let me know.

        Comment


        • #5
          Dear IIya,I am so intereted in this command.especially converting between ISO code and internal Chinese classification or Simplified Chinese or Pinyin.
          Best regards.

          Raymond Zhang
          Stata 17.0,MP

          Comment


          • #6
            Originally posted by Raymond Zhang View Post
            Dear IIya,I am so intereted in this command.especially converting between ISO code and internal Chinese classification or Simplified Chinese or Pinyin.
            Dear Raymond,

            Certainly. Maybe you can send me a private message so we can discuss the details and post the JSON file here later?
            We can add any names and classifications you need. I strongly suspect PRC will be a popular topic, so preparing a decent JSON file is in many people’s interest.

            PS You can already install the pyconvertu command, simply type
            Code:
            ssc install pyconvertu
            .

            Comment


            • #7
              I've prepared an ISO 639-2 classification of languages for my pyconvertu command, based on what is available at the U.S. LoC. If someone is interested, you can download the attached excel file.
              The included codes are iso2 (sparse), iso3 (T, full), name_en, name_fr, and name_de (German). RegEx is configured for name_en, iso3 (both B and T), and iso2 (where available).
              You can, for example, use it to encode CEPII GeoDist data or standardize lists of languages for some quantitative humanities research.

              Code:
              import excel "iso0639_classifications.xlsx", clear
              drop if _n > 489
              pyconvertu __dump, to("iso0639-2_all")
              clear
              pyconvertu __classification, from(iso0639-2_all.json) to(iso2) gen(iso2)
              pyconvertu __classification, from(iso0639-2_all.json) to(iso3) gen(iso3)
              pyconvertu __classification, from(iso0639-2_all.json) to(name_en) gen(name_en)
              I tested the RegEx on the excel itself, it worked well but may not yet be 100% bug-proof.
              Attached Files

              Comment


              • #8
                I've compiled one more classification, the ISO 4217 (currency codes), and re-tested the existing ones to remove bugs in regular expressions.
                So far, after a standardization of sorts, the following classifications are available:
                • ISO 3166-1, two-letter, three-letter, and numeric country codes with names in English and French (just above 200); – the default one which comes with the pyconvertu command in SSC.
                • ISO 4217 (April 2022), three-letter and numeric currency codes with names in English and French, and numeric units (just below 200);
                • ISO 639-2, two- and three-letter language codes with names in English and French (just under 500).
                After some consideration, I've decided to post the files here so that an eventual enthusiast could extend the translation to other languages such as Mandarin, Hindi, Arabic, etc.
                Could you please post the result(s) here if you do so? Simply add columns based on the logic of the enclosed Excel files. Your contribution is highly appreciated!

                iso3166_classifications.xlsx
                iso4217_classifications.xlsx
                iso0639_classifications.xlsx

                How do you construct useable JSON files for pyconvertu with the from() option? It's now straightforward:
                Code:
                        import excel <EXCEL FILE>, clear
                        pyconvertu __dump, to(<JSON FILE>)
                How does it all work?
                pyconvertu can, using regular expressions, convert English names of countries, currencies, and languages into the ISO codes and vice versa! It can also translate the mentioned names into available languages!
                For example,
                Code:
                       pyconvertu __classification, to(name_en) gen(country) // will create a list of English country names
                       pyconvertu country, to(iso3) gen(iso3) // will convert the country names into ISO 3166 alpha-3 codes
                       pyconvertu iso3, to(name_en) gen(name_en) // will reverse the last operation
                       pyconvertu name_en, to(name_fr) gen(name_fr) // will translate 200+ English country names into French just like that! Or to your language if you extend the existing Excel files.
                What is it good for?
                Merging cross-sectional or panel data of any kind. Suppose you have ISO country codes in one dataset (say, from the IMF) and country names in another (say, from UNCTAD).
                pyconvertu will solve the problem for you in a line of code.

                Comment

                Working...
                X