Stata 14, special characters and the import command

Mattias Öhman

Join Date: Jun 2015

Posts: 6
#1

Stata 14, special characters and the import command

12 Jun 2015, 03:53

I was happy to see that Unicode was implemented in Stata 14 (I am using Stata 14.0, Current update level 10 Jun 2015). However, I am experiencing problems using the import command. I have some Excel files that I want to import to Stata using the options firstrow and case(lower), i.e:

Code:

import excel "file.xlsx", firstrow case(lower) clear

In some of these files the first row have Swedish special characters (ÅÄÖ), for example "ÅR" ("YEAR" in english) and "KÖN" ("GENDER" in english). Since I am using the case(lower) option, the variable name for "ÅR" should be "år" and "kön" for "KÖN", but this does not work -- Stata imports the variables as "År" and "kÖn". Hence, it seems that the lower case option for (at least Swedish) special characters does not work.

To me, this clearly seems like a bug, and if so, what is the best way to report it? It seems a bit weird to register as a Stata user since I am a grad student and am using the department license (whatever you may call it).
Tags: import
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

12 Jun 2015, 04:57

You can email [email protected] citing this post. Or wait for someone from StataCorp to notice.
Comment
Svend Juul

Join Date: Apr 2014

Posts: 515
#3

12 Jun 2015, 06:20

I could reproduce the problem. Furthermore,

Code:

rename _all , lower

did not change the case of Ö and Å (nor of Æ and Ø). So apparently Stata does not "know" the relationship between lowercase ö and uppercase Ö, or, perhaps more generally, does not know this for the extended ASCII characters.
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#4

12 Jun 2015, 13:43

Converting cases of Unicode letter requires locale information since the same letter might be converted differently in different languages. For example, the lowercase letter of "I" is "i" in English but a dotless "i" in Turkish, see -help f_ustrupper- for details. Due to the requirement of the extra locale information, we decided to leave the -case- option in -import excel- and -lower- option in -rename- untouched, i.e., only ASCII letters are converted. We will introduce new options to handle Unicode case conversions in future Stata updates.
Comment
Mattias Öhman

Join Date: Jun 2015

Posts: 6
#5

12 Jun 2015, 13:55

I see. Thanks!
Comment
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#6

12 Jun 2015, 13:58

I assume that the case(lower) option is equivalent to the strlower() function. From the Stata Functions Reference Manual:

strlower(s)
Description: lowercase ASCII characters in string s
Unicode characters beyond the plain ASCII range are ignored.
strlower("THIS") = "this"
strlower("CAFÉ") = "cafÉ"
Domain s: strings
Range: strings with lowercased characters

As an alternative you could import the data as is and then convert the strings with the ustrlower() function.

ustrlower(s[,loc])
Description: lowercase all characters of Unicode string s under the given locale loc
If loc is not specified, the default locale is used. The same s but different loc may produce different results; for example, the lowercase letter of “I” is “i” in English but a dotless “i” in Turkish. The same Unicode character can be mapped to different Unicode characters based on its surrounding characters; for example, Greek capital letter sigma Σ has two lowercases: ς, if it is the final character of a word, or σ. The result can be longer or shorter than the input Unicode string in bytes.
ustrlower("MÈDIANE","fr") = "médiane"
ustrlower("ISTANBUL","tr") = "ıstanbul"
ustrlower("ὈΔΥΣΣΕΎΣ") = "ὀδυσσεύς"
Domain s: Unicode strings
Domain loc: locale name
Range: Unicode strings

There is a typo in the manual that's not easily visible in the text above. Instead of ustrlower("MÈDIANE","fr") with accent grave it should say ustrlower("MÉDIANE","fr") with accent aigu.
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#7

12 Jun 2015, 15:28

Thanks for finding this. We will get it fixed in a future Stata update. By the way, the help file is correct and the PDF manual has the typo.
Comment
Mattias Öhman

Join Date: Jun 2015

Posts: 6
#8

12 Sep 2015, 01:14

Update: This problem is apparently fixed now (for Swedish characters, at least).
Comment

Announcement

Stata 14, special characters and the import command

Comment

Comment

Comment

Comment

Comment

Comment

Comment