Sorting Japanese text in Stata 14

Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#1

Sorting Japanese text in Stata 14

08 Apr 2015, 11:22

Thank you for adding Unicode support to Stata. I am writing about the rules for sorting Japanese strings. I created a string variable "mystr" with a mix of hiragana, katakana and kanji words. With the commands below, I then changed the locale to Japanese, generated a sortkey as explained in [U] 12.4.2.5 (Sorting strings containing Unicode characters), and then sorted the string variable by that key.

Code:

set locale_functions jpn generate sortkeyjp = ustrsortkey(mystr, "jpn") sort sortkeyjp

The result is that the strings are sorted in two groups: one group with hiragana and katakana strings, another group with kanji strings. The first group is broadly sorted as one would expect (あ, か, さ, etc.) but the order of hiragana and katakana with the same reading (e.g. フランス and ふらんす) appears to be random. Repeated sorting shows that sometimes フランス precedes ふらんす and sometimes it is the other way. The order of the second group, with kanji strings, appears to follow the gojūon sort order.

The sort key is not human-readable, as explained in the user manual, but the entries for フランス and ふらんす in the variable "sortkeyjp" look the same. Is the order of strings with the same reading indeed random and is there any way to control the sort order, for example to always list hiragana words before katakana words with the same reading?

As a footnote, [U] 12.4.2.5 contains a reference to "a list of valid collation keywords and their meanings" at http://unicode.org/repos/cldr/trunk/.../collation.xml. I am not sure how one should read this page. In Chrome and Firefox the page appears as follows (I am only showing the first few lines).

Code:

This XML file does not appear to have any style information associated with it. The document tree is shown below.  <ldmlBCP47> <version number="$Revision$"/> <generation date="$Date$"/> <keyword> <key name="co" description="Collation type key" alias="collation"> <type name="big5han" description="Pinyin ordering for Latin, big5 charset ordering for CJK characters (used in Chinese)"/> <type name="compat" description="A previous version of the ordering, for compatibility" since="26"/> <type name="dict" description="Dictionary style ordering (such as in Sinhala)" alias="dictionary"/> <type name="direct" description="Binary code point order (used in Hindi)" deprecated="true"/>

In Internet Explorer the page starts with the lines below, the rest (starting with "Copyright") is the same as in Chrome and Firefox.

Code:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE ldmlBCP47 SYSTEM "../../common/dtd/ldmlBCP47.dtd">
Tags: None
Rebecca Pope (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 6
#2

08 Apr 2015, 12:56

For the sake of those who may read this post without reading the User's Guide first, I want to reiterate something important that we said there: In most cases, you do not need to use anything other than Stata's standard sort command, even with strings that contain characters beyond the plain ASCII range. You can use by:, statsby:, merge, etc. without any of the following. In fact, for most Stata operations you are likely to be better off avoiding collation, the sorting of strings in a language-sensitive manner, altogether.

That said, here is an overview of what is causing the sorting pattern that Friedrich is seeing. Unicode sort is a multilevel sort with different level of comparasions. The User's Guide discusses this briefly but sends you to the Functions Manual for a more complete explanation.

Briefly, differences in strings can be classified as primary, secondary, tertiary, or quarternary in strength. The strength for ustrsortkey() is tertiary, however the difference between Katakana and Hiragana is considered to be quaternary. That means that, from the perspective of ustrsortkey(), there is not a difference between words with the same reading.

Using the example that Friedrich gives:

Code:

. set obs 2 number of observations (_N) was 0, now 2 . generate x = "フランス" . replace x = "ふらんす" in 2 (1 real change made) . generate skey1 = ustrsortkey(x,"jpn")

For illustration purposes and because we just have two observations, we use list with a logical test to prove that the sort keys are equal.

Code:

. list x if skey1[1] == skey1[2] +----------+ | x | |----------| 1. | フランス | 2. | ふらんす | +----------+

As noted, they are the same. To force a difference between Katakana and Hirigana, you must use the extended version of ustrsortkey() --- ustrsortkeyex() --- and set the strength to 4. This will cause Stata to look for quarternary differences.

Code:

generate skey2 = ustrsortkeyex(x, "jpn", 4, -1, -1, -1, -1, -1, -1) . list x if skey2[1] == skey2[2]

The sort keys now distinguish between フランス and ふらんす. We can then sort by skey2.

Code:

. sort skey2 . list x +----------+ | x | |----------| 1. | ふらんす | 2. | フランス | +----------+

I hope this is helpful.
Comment

Friedrich Huebler

Join Date: Apr 2014
Posts: 1053

10 Apr 2015, 13:03

Thank you for the helpful answer. Is it possible to sort a mix of hiragana, katakana and kanji expressions by their reading (あ, か, さ...)? I could not find an answer to this question in the manuals.

As an example, take a list of country names.

Code:

input str14 countryjp str14 countryen sort
"イタリア" "Italy" 1
"中国" "China" 2
"トーゴ" "Togo" 3
"日本" "Japan" 4
"フランス" "France" 5
"南スーダン" "South Sudan" 6
"メキシコ" "Mexico" 7
end

In Japanese, this list should be sorted as indicated by the values in the variable "sort". (Please note that there is a problem with the Japanese name of South Sudan. As you can see below, the ン at the end, which is displayed correctly in the Do-File Editor, is replaced by another character in the Data Editor and in the Results window. The character ン also appears in the Japanese name of France but is shown correctly in the Data Editor and the Results window. Could you explain this problem? I use Stata with Windows 7 Professional SP1.)

Code:

. list, sep(0) noobs

  +--------------------------------+
  | countryjp     countryen   sort |
  |--------------------------------|
  |  イタリア         Italy      1 |
  |      中国         China      2 |
  |    トーゴ          Togo      3 |
  |      日本         Japan      4 |
  |  フランス        France      5 |
  | 南スーダ�   South Sudan      6 |
  |  メキシコ        Mexico      7 |
  +--------------------------------+

Sorting in the following manner lists katakana country names first, followed by kanji country names. The latter are not in the proper order because South Sudan should be listed after Japan.

Code:

. gen sortkey = ustrsortkey(countryjp, "jpn")
. sort sortkey
. list countryjp countryen sort, sep(0) noobs

  +--------------------------------+
  | countryjp     countryen   sort |
  |--------------------------------|
  |  イタリア         Italy      1 |
  |    トーゴ          Togo      3 |
  |  フランス        France      5 |
  |  メキシコ        Mexico      7 |
  |      中国         China      2 |
  | 南スーダ�   South Sudan      6 |
  |      日本         Japan      4 |
  +--------------------------------+

The command that you showed for proper sorting of hiragana and katakana yields the same result.

Code:

. gen sortkey2 = ustrsortkeyex(countryjp, "jpn", 4, -1, -1, -1, -1, -1, -1)
. sort sortkey2
. list countryjp countryen sort, sep(0) noobs

  +--------------------------------+
  | countryjp     countryen   sort |
  |--------------------------------|
  |  イタリア         Italy      1 |
  |    トーゴ          Togo      3 |
  |  フランス        France      5 |
  |  メキシコ        Mexico      7 |
  |      中国         China      2 |
  | 南スーダ�   South Sudan      6 |
  |      日本         Japan      4 |
  +--------------------------------+

In my example I can get the desired sort order through the "sort" variable. Is it possible to sort the data the same way by only evaluating the Japanese strings, in this case the variable "countryjp"? I am aware of the difficulties caused by the fact that kanji have more than one reading but I am trying to understand how Stata processes Japanese text and more generally Unicode.

Comment

Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 344
#4

10 Apr 2015, 14:24

The variable is entered with

Code:

input str14 countryjp

which means take the first 14 bytes of the inputted string. For example

Code:

input str2 x abcd end list x

will show x contains "ab". But in this case, 14 bytes results the last byte of the last character of "南スーダン" being cut off, and hence produces an invalid UTF-8 sequence. See Example 4 in [D] input for an explanation of what to do with str# when your strings contain Unicode.

Sorting Japanese words/phrases from a mixture of Japanese alphabets/writing systems is something a human can do, but is truly difficult for a computer to perform.

Stata uses the ICU implementation of the Unicode Collation Algorithm (UCA). There are some FAQs on the unicode.org site which relate to this. For example, search for 'Hiragana' within the page linked to in the previous sentence.

You may also find the following blog post interesting. It is not about Stata, but it discusses the difficulties in sorting a mixture of Japanese writing systems, particularly with Kanji:

http://www.localizingjapan.com/blog/...olved-problem/

Last edited by Hua Peng (StataCorp); 10 Apr 2015, 14:29.
Comment
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#5

13 Apr 2015, 11:31

Thank you for the additional information and thank you for the reminder that many entries in the PDF manuals now contain references to Unicode.
Comment

Announcement