Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata 14, special characters and the import command

    I was happy to see that Unicode was implemented in Stata 14 (I am using Stata 14.0, Current update level 10 Jun 2015). However, I am experiencing problems using the import command. I have some Excel files that I want to import to Stata using the options firstrow and case(lower), i.e:
    Code:
    import excel "file.xlsx", firstrow case(lower) clear
    In some of these files the first row have Swedish special characters (ÅÄÖ), for example "ÅR" ("YEAR" in english) and "KÖN" ("GENDER" in english). Since I am using the case(lower) option, the variable name for "ÅR" should be "år" and "kön" for "KÖN", but this does not work -- Stata imports the variables as "År" and "kÖn". Hence, it seems that the lower case option for (at least Swedish) special characters does not work.

    To me, this clearly seems like a bug, and if so, what is the best way to report it? It seems a bit weird to register as a Stata user since I am a grad student and am using the department license (whatever you may call it).

  • #2
    You can email [email protected] citing this post. Or wait for someone from StataCorp to notice.

    Comment


    • #3
      I could reproduce the problem. Furthermore,
      Code:
      rename _all , lower
      did not change the case of Ö and Å (nor of Æ and Ø). So apparently Stata does not "know" the relationship between lowercase ö and uppercase Ö, or, perhaps more generally, does not know this for the extended ASCII characters.

      Comment


      • #4
        Converting cases of Unicode letter requires locale information since the same letter might be converted differently in different languages. For example, the lowercase letter of "I" is "i" in English but a dotless "i" in Turkish, see -help f_ustrupper- for details. Due to the requirement of the extra locale information, we decided to leave the -case- option in -import excel- and -lower- option in -rename- untouched, i.e., only ASCII letters are converted. We will introduce new options to handle Unicode case conversions in future Stata updates.

        Comment


        • #5
          I see. Thanks!

          Comment


          • #6
            I assume that the case(lower) option is equivalent to the strlower() function. From the Stata Functions Reference Manual:

            strlower(s)
            Description: lowercase ASCII characters in string s
            Unicode characters beyond the plain ASCII range are ignored.
            strlower("THIS") = "this"
            strlower("CAFÉ") = "cafÉ"
            Domain s: strings
            Range: strings with lowercased characters
            As an alternative you could import the data as is and then convert the strings with the ustrlower() function.

            ustrlower(s[,loc])
            Description: lowercase all characters of Unicode string s under the given locale loc
            If loc is not specified, the default locale is used. The same s but different loc may produce different results; for example, the lowercase letter of “I” is “i” in English but a dotless “i” in Turkish. The same Unicode character can be mapped to different Unicode characters based on its surrounding characters; for example, Greek capital letter sigma Σ has two lowercases: ς, if it is the final character of a word, or σ. The result can be longer or shorter than the input Unicode string in bytes.
            ustrlower("MÈDIANE","fr") = "médiane"
            ustrlower("ISTANBUL","tr") = "ıstanbul"
            ustrlower("ὈΔΥΣΣΕΎΣ") = "ὀδυσσεύς"
            Domain s: Unicode strings
            Domain loc: locale name
            Range: Unicode strings
            There is a typo in the manual that's not easily visible in the text above. Instead of ustrlower("MÈDIANE","fr") with accent grave it should say ustrlower("MÉDIANE","fr") with accent aigu.

            Comment


            • #7
              Thanks for finding this. We will get it fixed in a future Stata update. By the way, the help file is correct and the PDF manual has the typo.

              Comment


              • #8
                Update: This problem is apparently fixed now (for Swedish characters, at least).

                Comment

                Working...
                X