-import delimited- and handling utf-8 encoding

Mike Lacy

Join Date: Apr 2014

Posts: 2381
#1

-import delimited- and handling utf-8 encoding

24 Aug 2020, 12:43

Can one determine the proper encoding for a text file to specify when using -import delimited-?

I've increasingly encountered CSV files from various sources that imported with upper ASCII characters in a variable name and label because (I learned) the files had UTF-8 encoding, which I mistakenly imported with the default latin1 encoding. (Stata version 15.1). While this problem is easy enough to fix after the fact, is there a way to get the proper encoding other than having external knowledge of how the file was encoded and specifying it with the encoding() option? I see that newer versions of Microsoft Excel offer UTF-8 encoding of CSV files as an option, which I guess accounts for this issue becoming more frequent.

(While there have been other threads on StataList in the direction of this topic, I didn't find one that narrowed down the issue to the possibility of handling the encoding difference before it bites.)
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2335
#2

24 Aug 2020, 16:42

Mike, are you specifying the encoding everytime or letting Stata make an educated guess?

On a related topic, I have found that having a good quality, programmer friendly text editor (like Sublime Text or Notepad++) allow me to open files under a specific encoding, just in case I wish to see which one will (not) work. On occasion, I've resaved text files with a different encoding.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2381
#3

24 Aug 2020, 18:18

Thanks, Leonardo. I'm just letting Stata "do its thing," which the documentation seems to suggest is to simply default to "latin1," so I don't think at least in my Stata version that Stata investigates. Yes, I did discover that TextPad will let me open the file as UTF-8 and check it out.

I had the thought that perhaps there was some easy way to discover the encoding without going outside Stata, more for curiosity than anything else. From my experience here, a very "klugy" approach would be to read the file with -fileread()-, look at the first three bytes to see if they are HEX "ef bb bf" which seems to fit all the UTF-8 csv files I've looked at so far, and proceed accordingly.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1858
#4

24 Aug 2020, 21:19

Dear Mike,

not all the UTF-8 files will contain the EF BB BF as the first bytes. These bytes are known as Byte-Order-Marker and are entirely optional. A UTF-8 file is perfectly legal with or without a BOM. The Windows' Notepad.exe writes the BOM since around Windows Vista. But the wind, it seems, has changed its direction and the default is now not write the BOM at all. See this for details:
https://www.bleepingcomputer.com/new...oding-support/

I had been suffering from cross-software differences with regards to the writing or expecting the BOM myself (and a while ago).
That resulted in http://www.radyakin.org/progs/bomtools/

This is if you want to separate Unicode from ANSI or ASCII files. If you want to distinguish between the codepages within ANSII, then no way. Unless you know something about the content to have a dictionary or a rule-based attack on it. The problem is not new to CSV files, traditional Stata datasets may have also contained Cyrillic, Arabic, or other characters in the corresponding code page with no way to tell unless you try or someone tells you. Hence the trickery like on page 17 here.

Hope this helps!

Best regards, Sergiy Radyakin
4 likes
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2381
#5

25 Aug 2020, 09:53

Thanks, Sergiy, for the thorough answer.
Comment

Announcement

-import delimited- and handling utf-8 encoding

Comment

Comment

Comment

Comment