Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • character encoding in text log files

    Hi
    Does anyone know which character encoding is used in Stata text log files?

    When I test a log file I get:
    Code:
    . unicode analyze mylog.log
    
      File summary (before starting):
            1  file(s) specified
            1  file(s) to be examined ...
    
      File mylog.log (text file)
                   399 lines in file
                   395 lines ASCII
                     4 lines UTF-8
              File does not need translation, except ...
              The file appears to be UTF-8 already.
              Sometimes files that still need translating can look like UTF-8.  
              See lines 186, 385, 386, and 387.  
              A total of 4 lines out of 399 appear to be UTF8.
    So apparently it is a mix of ascii and utf8. I would like to use -unicode translate- on my log file to make it all utf8 but then I need the original character encoding to do so.
    Or is it already utf8?
    Kind regards

    nhb

  • #2
    Does this discussion from Monday shed any light on your question?

    http://www.statalist.org/forums/foru...a-14?p=1326071

    Comment


    • #3
      Hi William
      Thank you for comment. I was more hoping on something like:
      • Yes, log files are UTF8
      • No, the logs files are ISO-8859-1
      • No, the character encoding depends on your computer settings
      In the last case it would be nice if there were a tool that tells you the type of encoding you have in the logs, too.
      Kind regards

      nhb

      Comment


      • #4
        With a few exception (e.g. filefilter), all of Stata 14 works in Unicode. All log files generated by Stata 14 are UTF-8 encoded.

        It takes a single byte to store a plain ASCII character (values 0-127). In a UTF-8 encoded document, all characters that are not plain ASCII are stored using two or more bytes and each of those bytes are in the range of 128-255. So a single byte in the 128-255 range is invalid and not all combinations of 128-255 bytes are allowed in UTF-8.

        Given the above rules, it's possible for a text file that uses an extended ASCII encoding to appear UTF-8 (e.g. two consecutive characters that have a meaning in ISO-8859-1 but also form a valid UTF-8 character). That's why the report quoted in #1 appears to be hedging. Most likely, a file with a different encoding would show something like:

        Code:
        . unicode analyze windows.txt
        
          File summary (before starting):
                1  file(s) specified
                1  file(s) to be examined ...
        
          File windows.txt (text file)
                         8 lines in file
                         7 lines ASCII
                         1 lines need translation
                  --------------------------------------------------------------------------------------------------------------------------
                  File needs translation.  Use unicode translate on this file.
        
          File windows.txt needs translation
        
          File summary:
                1 file(s) need translation

        Comment


        • #5
          Hi Robert
          This was what I needed to know.
          Thank you both very much.
          Kind regards

          nhb

          Comment

          Working...
          X