Thank you for adding Unicode support to Stata. I am writing about the rules for sorting Japanese strings. I created a string variable "mystr" with a mix of hiragana, katakana and kanji words. With the commands below, I then changed the locale to Japanese, generated a sortkey as explained in [U] 12.4.2.5 (Sorting strings containing Unicode characters), and then sorted the string variable by that key.
The result is that the strings are sorted in two groups: one group with hiragana and katakana strings, another group with kanji strings. The first group is broadly sorted as one would expect (あ, か, さ, etc.) but the order of hiragana and katakana with the same reading (e.g. フランス and ふらんす) appears to be random. Repeated sorting shows that sometimes フランス precedes ふらんす and sometimes it is the other way. The order of the second group, with kanji strings, appears to follow the gojūon sort order.
The sort key is not human-readable, as explained in the user manual, but the entries for フランス and ふらんす in the variable "sortkeyjp" look the same. Is the order of strings with the same reading indeed random and is there any way to control the sort order, for example to always list hiragana words before katakana words with the same reading?
As a footnote, [U] 12.4.2.5 contains a reference to "a list of valid collation keywords and their meanings" at http://unicode.org/repos/cldr/trunk/.../collation.xml. I am not sure how one should read this page. In Chrome and Firefox the page appears as follows (I am only showing the first few lines).
In Internet Explorer the page starts with the lines below, the rest (starting with "Copyright") is the same as in Chrome and Firefox.
Code:
set locale_functions jpn generate sortkeyjp = ustrsortkey(mystr, "jpn") sort sortkeyjp
The sort key is not human-readable, as explained in the user manual, but the entries for フランス and ふらんす in the variable "sortkeyjp" look the same. Is the order of strings with the same reading indeed random and is there any way to control the sort order, for example to always list hiragana words before katakana words with the same reading?
As a footnote, [U] 12.4.2.5 contains a reference to "a list of valid collation keywords and their meanings" at http://unicode.org/repos/cldr/trunk/.../collation.xml. I am not sure how one should read this page. In Chrome and Firefox the page appears as follows (I am only showing the first few lines).
Code:
This XML file does not appear to have any style information associated with it. The document tree is shown below. <!-- Copyright © 1991-2014 Unicode, Inc. CLDR data files are interpreted according to the LDML specification (http://unicode.org/reports/tr35/) For terms of use, see http://www.unicode.org/copyright.html --> <ldmlBCP47> <version number="$Revision$"/> <generation date="$Date$"/> <keyword> <key name="co" description="Collation type key" alias="collation"> <type name="big5han" description="Pinyin ordering for Latin, big5 charset ordering for CJK characters (used in Chinese)"/> <type name="compat" description="A previous version of the ordering, for compatibility" since="26"/> <type name="dict" description="Dictionary style ordering (such as in Sinhala)" alias="dictionary"/> <type name="direct" description="Binary code point order (used in Hindi)" deprecated="true"/>
Code:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE ldmlBCP47 SYSTEM "../../common/dtd/ldmlBCP47.dtd">
Comment