Infix with unicode characters

Mark Greenan

Join Date: Mar 2024

Posts: 2
#1

Infix with unicode characters

12 Mar 2024, 14:43

Hello,

I'm trying to use the infix command to import a fixed-width text file that contains Japanese.

The original dataset was in SHIFT_JIS encoding (which is "ibm-943_P15A-2003" based on the page from "help encodings"), so I converted the file to UTF-8 using unicode translate:

unicode encoding set ibm-943_P15A-2003
unicode translate "h3jcho19.dat"

I then confirmed with my text editor (VS Code) that the Japanese characters appear correctly whenever I view the file with a UTF-8 encoding.

Then I used the infix command to import the file. However, when Stata imports the file, many of the characters are incorrect and they appear as squares with question marks (�). The infix help file says "If string data are encoded as ASCII or UTF-8, they will be imported correctly." So why am I unable to get the correct Japanese characters?

Any suggestions?

Thanks in advance for the help. As an FYI, I have a Mac and I am using Stata 18.

(as an aside, I also tried converting the original file to UTF-8 using Terminal's iconv command, and once again, I confirmed that my text editor can read the new file, but then Stata could not read this file either).
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4352
#2

13 Mar 2024, 00:37

Originally posted by Mark Greenan View Post

. . . why am I unable to get the correct Japanese characters?

Difficult to say given what you've shown, but one possibility to consider is that your specification for the string variables' beginning byte and end byte are off by one or so.

You might want to post a reproducible example using a .dct file with a line or two of your dataset enclosed at its end.
Comment
Mark Greenan

Join Date: Mar 2024

Posts: 2
#3

13 Mar 2024, 11:10

You were right! Thank you very much.

The problem was that the column range in the infix specifications was different from the column range indicated by my text editor for the beginning and end of a variable.

Presumably this discrepancy is because the Japanese unicode characters take up more than one byte and infix specification cares about the byte-level column range rather that what the user actually sees.

All good now!

Thank you!
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4352
#4

13 Mar 2024, 18:32

Originally posted by Mark Greenan View Post

Presumably this discrepancy is because the Japanese unicode characters take up more than one byte and infix specification cares about the byte-level column range rather that what the user actually sees.

Yep. And you're welcome—thanks for the closure.
Comment

Announcement

Infix with unicode characters

Comment

Comment

Comment