Thanks to Kit Baum, the chartab package is now available on SSC. To install, type in Stata's Command window:
This installs two commands that tabulate character frequency counts. The chartab command tabulates Unicode characters (requires Stata 14 or higher) and the chartabb command tabulates byte codes (requires Stata 10 or higher).
If you are using an older version of Stata (version 13 or earlier), a character is encoded using a single byte. This allows for 256 distinct values. char(0) to char(127) are ASCII codes but there is no standard for what char(128) to char(255) represent.
If you are using Stata 14 or higher, each character is encoded in UTF-8. This is a storage-efficient Unicode encoding where the 128 ASCII characters are encoded using a single byte (using the same ASCII byte code). All other Unicode characters are encoded using a multi-byte sequence (from two to four bytes, with each byte code >= 128). So by design, UTF-8 is backwards compatible with ASCII.
Both chartab and chartabb can process text from any combination of string variables, files, string scalars, and string literals in a single run. Here's an example with a string literal:
I can do the same in Stata 10 using chartabb. But since this is an older version of Stata, each character is encoded using a single byte code. I'm on a Mac, so characters are encoded using the Mac OS Roman encoding.
Code:
ssc install chartab
If you are using an older version of Stata (version 13 or earlier), a character is encoded using a single byte. This allows for 256 distinct values. char(0) to char(127) are ASCII codes but there is no standard for what char(128) to char(255) represent.
If you are using Stata 14 or higher, each character is encoded in UTF-8. This is a storage-efficient Unicode encoding where the 128 ASCII characters are encoded using a single byte (using the same ASCII byte code). All other Unicode characters are encoded using a multi-byte sequence (from two to four bytes, with each byte code >= 128). So by design, UTF-8 is backwards compatible with ASCII.
Both chartab and chartabb can process text from any combination of string variables, files, string scalars, and string literals in a single run. Here's an example with a string literal:
Code:
. chartab , literal("j'ai hâte à l'été") decimal hexadecimal character | frequency unique name ------------------------------------+------------------------------------------------------ 32 \u0020 | 3 SPACE 39 \u0027 ' | 2 APOSTROPHE 97 \u0061 a | 1 LATIN SMALL LETTER A 101 \u0065 e | 1 LATIN SMALL LETTER E 104 \u0068 h | 1 LATIN SMALL LETTER H 105 \u0069 i | 1 LATIN SMALL LETTER I 106 \u006a j | 1 LATIN SMALL LETTER J 108 \u006c l | 1 LATIN SMALL LETTER L 116 \u0074 t | 2 LATIN SMALL LETTER T 224 \u00e0 à | 1 LATIN SMALL LETTER A WITH GRAVE 226 \u00e2 â | 1 LATIN SMALL LETTER A WITH CIRCUMFLEX 233 \u00e9 é | 2 LATIN SMALL LETTER E WITH ACUTE ------------------------------------+------------------------------------------------------ freq. count distinct ASCII characters = 13 9 Multibyte UTF-8 characters = 4 3 Unicode replacement character = 0 0 Total Unicode characters = 17 12 .
Code:
. chartabb , literal("j'ai hâte à l'été") decimal hexadecimal character | frequency ------------------------------------+-------------------------------------------------------------------- 32 20 | 3 39 27 ' | 2 97 61 a | 1 101 65 e | 1 104 68 h | 1 105 69 i | 1 106 6A j | 1 108 6C l | 1 116 74 t | 2 136 88 à | 1 137 89 â | 1 142 8E é | 2 ------------------------------------+-------------------------------------------------------------------- ASCII control characters = 0 ASCII printable characters = 13 Extended characters = 4 Total characters (bytes) = 17 .
Comment