Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unicode problem (?) when loading data from MS SQL Server database

    I use Stata to extract data from an MS SQL Server database. For example, I use this command to load a table with country names in Arabic, Chinese, English, French, Russian and Spanish:
    Code:
    odbc load, exec("select * from dbasisdess.country")
    In Stata 13, this loads the English, French and Spanish names correctly and with all accents (e.g. Afganistán and Algérie) but for the Arabic, Chinese and Russian country names only question marks are shown in Stata.
    Code:
    Afghanistan   Afghanistan   Afganistán   ??????????   ???     ?????????
    Algeria       Algérie       Argelia      ?????        ?????   ???????
    In Stata 14, the same command loads the Arabic, Chinese, English and Russian names correctly but in the French and Spanish names the accents have been replaced by a different character.
    Code:
    Afghanistan   Afghanistan   Afganist�n   Афганистан   阿富汗       أفغانستان
    Algeria       Alg�rie      Argelia       Алжир        阿尔及利亚   الجزائر
    I assume that this is a coding problem but the section "Unicode and ODBC" in [D] odbc contains very little information. I also don't know if version is supposed to have an effect (help version doesn't mention Unicode) but I get the same result in Stata 14 when the do-file starts with version 13.1 or version 14.0.

    How could I load all country names correctly in Stata 14?

  • #2
    I believe the problem you are seeing is caused by your table having country names that are a mix of extended ASCII (in the French and Spanish names) and Unicode (Russian and Chinese). The best way for us to resolve this is for Friedrichis to email [email protected].

    Comment


    • #3
      The English, French and Spanish country names are stored as varchar in the SQL Server database. The Arabic, Chinese and Russian country names are stored as nvarchar. The solution, provided by Stata tech support, was to use the unicode translate command.
      Code:
      odbc load, exec("select * from dbasisdess.country")
      save "Country names.dta"
      clear
      unicode analyze "Country names.dta"
      unicode encoding set "latin1"
      unicode translate "Country names.dta"
      All country names are now displayed correctly in Stata.
      Code:
      Afghanistan   Afghanistan   Afganistán   Афганистан   阿富汗       أفغانستان
      Algeria       Algérie       Argelia      Алжир        阿尔及利亚   الجزائر

      Comment


      • #4
        You can also explicitly CAST() the extended ASCII into Unicode on-the-fly in your query. This will deliver the extended ASCII to Stata as Unicode, which will display correctly with all of the accents etc. (You might not ultimately want those Stata variables in Unicode, but at least a legible transfer can be in one command.)

        I've attached a do-file and its SMCL log file that shows how to do this. The do-file and log-file first replicate your problem and then show that SELECT . . ., CAST(some_extended_ascii AS NVARCHAR) AS extended_ascii, . . . does allow for correct loading of the extended-ASCII column's text into Stata.*

        I've also attached the SQL DDL / populate-table file to show the table's data-types and what was put into the table. (I've had to append a .txt suffix in order for the forum software to accept it as an attachment.)

        Using MS SQL Server 2012 through SQL Native Client 11.0 on a Windows 8.1 development box.

        *My Windows code-page doesn't cover Władysław, and so that didn't get loaded into the database intact in the first place in the extended-ASCII column, but the ISO Latin 1 in the other rows of data did.
        Attached Files

        Comment


        • #5
          Thank you for this suggestion. I can load all country names correctly with the help of CAST() and don't have to use unicode translate.
          Code:
          odbc load, exec("select cast(CO_NAME_EN as nvarchar(200)) as CO_NAME_EN,
            cast(CO_NAME_FR as nvarchar(200)) as CO_NAME_FR,
            cast(CO_NAME_SP as nvarchar(200)) as CO_NAME_SP,
            CO_NAME_RU, CO_NAME_CH, CO_NAME_AR from dbasisdess.country")

          Comment

          Working...
          X