character encoding in text log files

Niels Henrik Bruun

Join Date: Aug 2014
Posts: 555

character encoding in text log files

10 Feb 2016, 01:47

Hi
Does anyone know which character encoding is used in Stata text log files?

When I test a log file I get:

Code:

. unicode analyze mylog.log

  File summary (before starting):
        1  file(s) specified
        1  file(s) to be examined ...

  File mylog.log (text file)
               399 lines in file
               395 lines ASCII
                 4 lines UTF-8
          File does not need translation, except ...
          The file appears to be UTF-8 already.
          Sometimes files that still need translating can look like UTF-8.  
          See lines 186, 385, 386, and 387.  
          A total of 4 lines out of 399 appear to be UTF8.

So apparently it is a mix of ascii and utf8. I would like to use -unicode translate- on my log file to make it all utf8 but then I need the original character encoding to do so.
Or is it already utf8?

Kind regards

nhb

Tags: None

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

10 Feb 2016, 09:41

Does this discussion from Monday shed any light on your question?

http://www.statalist.org/forums/foru...a-14?p=1326071
Comment
Niels Henrik Bruun

Join Date: Aug 2014

Posts: 555
#3

11 Feb 2016, 03:49

Hi William
Thank you for comment. I was more hoping on something like:
Yes, log files are UTF8

No, the logs files are ISO-8859-1

No, the character encoding depends on your computer settings

In the last case it would be nice if there were a tool that tells you the type of encoding you have in the logs, too.

Kind regards

nhb
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#4

11 Feb 2016, 08:35

With a few exception (e.g. filefilter), all of Stata 14 works in Unicode. All log files generated by Stata 14 are UTF-8 encoded.

It takes a single byte to store a plain ASCII character (values 0-127). In a UTF-8 encoded document, all characters that are not plain ASCII are stored using two or more bytes and each of those bytes are in the range of 128-255. So a single byte in the 128-255 range is invalid and not all combinations of 128-255 bytes are allowed in UTF-8.

Given the above rules, it's possible for a text file that uses an extended ASCII encoding to appear UTF-8 (e.g. two consecutive characters that have a meaning in ISO-8859-1 but also form a valid UTF-8 character). That's why the report quoted in #1 appears to be hedging. Most likely, a file with a different encoding would show something like:

Code:

. unicode analyze windows.txt File summary (before starting): 1 file(s) specified 1 file(s) to be examined ... File windows.txt (text file) 8 lines in file 7 lines ASCII 1 lines need translation -------------------------------------------------------------------------------------------------------------------------- File needs translation. Use unicode translate on this file. File windows.txt needs translation File summary: 1 file(s) need translation
1 like
Comment
Niels Henrik Bruun

Join Date: Aug 2014

Posts: 555
#5

12 Feb 2016, 01:32

Hi Robert
This was what I needed to know.
Thank you both very much.

Kind regards

nhb
Comment

Announcement

character encoding in text log files

Comment

Comment

Comment

Comment