Unicode problem (?) when loading data from MS SQL Server database

Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#1

Unicode problem (?) when loading data from MS SQL Server database

20 Apr 2015, 09:57

I use Stata to extract data from an MS SQL Server database. For example, I use this command to load a table with country names in Arabic, Chinese, English, French, Russian and Spanish:

Code:

odbc load, exec("select * from dbasisdess.country")

In Stata 13, this loads the English, French and Spanish names correctly and with all accents (e.g. Afganistán and Algérie) but for the Arabic, Chinese and Russian country names only question marks are shown in Stata.

Code:

Afghanistan Afghanistan Afganistán ?????????? ??? ????????? Algeria Algérie Argelia ????? ????? ???????

In Stata 14, the same command loads the Arabic, Chinese, English and Russian names correctly but in the French and Spanish names the accents have been replaced by a different character.

Code:

Afghanistan Afghanistan Afganist�n Афганистан 阿富汗 أفغانستان Algeria Alg�rie Argelia Алжир 阿尔及利亚 الجزائر

I assume that this is a coding problem but the section "Unicode and ODBC" in [D] odbc contains very little information. I also don't know if version is supposed to have an effect (help version doesn't mention Unicode) but I get the same result in Stata 14 when the do-file starts with version 13.1 or version 14.0.

How could I load all country names correctly in Stata 14?
Tags: None
Kevin Crow (StataCorp)

StataCorp Employee

Join Date: Apr 2014

Posts: 64
#2

20 Apr 2015, 13:05

I believe the problem you are seeing is caused by your table having country names that are a mix of extended ASCII (in the French and Spanish names) and Unicode (Russian and Chinese). The best way for us to resolve this is for Friedrichis to email [email protected].
Comment
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#3

23 Apr 2015, 12:24

The English, French and Spanish country names are stored as varchar in the SQL Server database. The Arabic, Chinese and Russian country names are stored as nvarchar. The solution, provided by Stata tech support, was to use the unicode translate command.

Code:

odbc load, exec("select * from dbasisdess.country") save "Country names.dta" clear unicode analyze "Country names.dta" unicode encoding set "latin1" unicode translate "Country names.dta"

All country names are now displayed correctly in Stata.

Code:

Afghanistan Afghanistan Afganistán Афганистан 阿富汗 أفغانستان Algeria Algérie Argelia Алжир 阿尔及利亚 الجزائر
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#4

26 Apr 2015, 01:50

You can also explicitly CAST() the extended ASCII into Unicode on-the-fly in your query. This will deliver the extended ASCII to Stata as Unicode, which will display correctly with all of the accents etc. (You might not ultimately want those Stata variables in Unicode, but at least a legible transfer can be in one command.)

I've attached a do-file and its SMCL log file that shows how to do this. The do-file and log-file first replicate your problem and then show that SELECT . . ., CAST(some_extended_ascii AS NVARCHAR) AS extended_ascii, . . . does allow for correct loading of the extended-ASCII column's text into Stata.*

I've also attached the SQL DDL / populate-table file to show the table's data-types and what was put into the table. (I've had to append a .txt suffix in order for the forum software to accept it as an attachment.)

Using MS SQL Server 2012 through SQL Native Client 11.0 on a Windows 8.1 development box.

*My Windows code-page doesn't cover Władysław, and so that didn't get loaded into the database intact in the first place in the extended-ASCII column, but the ISO Latin 1 in the other rows of data did.
Attached Files

TextCheck.do (625 Bytes, 1 view)

TextCheck.smcl (3.4 KB, 1 view)

crTextCheck.SQL.txt (3.7 KB, 1 view)
1 like
Comment

Friedrich Huebler

Join Date: Apr 2014
Posts: 1053

27 Apr 2015, 11:11

Thank you for this suggestion. I can load all country names correctly with the help of CAST() and don't have to use unicode translate.

Code:

odbc load, exec("select cast(CO_NAME_EN as nvarchar(200)) as CO_NAME_EN,
  cast(CO_NAME_FR as nvarchar(200)) as CO_NAME_FR,
  cast(CO_NAME_SP as nvarchar(200)) as CO_NAME_SP,
  CO_NAME_RU, CO_NAME_CH, CO_NAME_AR from dbasisdess.country")

Announcement

Unicode problem (?) when loading data from MS SQL Server database

Comment

Comment

Comment

Comment