Import delimited - ignoring linebreaks

Tom Groll

Join Date: Jul 2021

Posts: 11
#1

Import delimited - ignoring linebreaks

21 Jul 2021, 16:15

Hello,
I have (random) linebreaks that I cannot remove with Notepad++ before importing. Three observations, seven variables, delimiter is |,|:

|id1|,|name1, first1 raw|,|cat1|,|2017|,||,||,|n|
|id2|,|name2, first1 raw|,|cat2|,|2017|,|text1
text2
text3
text4|,||,|n|
|id3|,|name3, first3 raw|,|cat3|,|2017|,||,||,|n|

I am getting too many observations with

import delimited "file.txt", delimiter("|,|", asstring) bindquote(strict) maxquotedrows(unlimited) clear

Is there a way to ignore the linebreak and read lines 2-5 as one line/observation?

Thank you!
Tags: import

Ali Atia

Join Date: May 2020
Posts: 737

21 Jul 2021, 21:46

According to its helpfile, "import delimited reads into memory a text file in which there is one observation per line and the values are separated by commas, tabs, or some other delimiter." So you will need an alternative approach. Here is one such approach, which relies on the end of each row being "n|" followed by a newline:

Code:

clear all
set obs 1
gen x = subinstr(ustrregexra(ustrregexra(fileread("test.txt"),"n\|((\r\n|\r|\n|$)(\|)?)"," end of row "),"(\r\n|\r|\n)","")," end of row ","n`=char(10)'",.)
split x, p("`=char(10)'") gen(var)
drop x
gen i = _n
reshape long var, i(i) j(j)
split var, p("|,|")
drop i j var
replace var1 = subinstr(var1,"|","",1)
list, noobs

  +-----------------------------------------------------------------------------+
  | var1                var2   var3   var4                   var5   var6   var7 |
  |-----------------------------------------------------------------------------|
  |  id1   name1, first1 raw   cat1   2017                                    n |
  |  id2   name2, first1 raw   cat2   2017   text1text2text3text4             n |
  |  id3   name3, first3 raw   cat3   2017                                    n |
  +-----------------------------------------------------------------------------+

Last edited by Ali Atia; 21 Jul 2021, 21:53.

Comment

Tom Groll

Join Date: Jul 2021

Posts: 11
#3

21 Jul 2021, 21:52

My apologies I just noticed that in my example all observations end with n (no). It is the most common case in the data but there are observations ending with y (yes). Would you run above with replace for the y observations or loop?

Thank you!
Comment

Ali Atia

Join Date: May 2020
Posts: 737

21 Jul 2021, 21:59

Updated code accounting for both y and n endings:

Code:

clear all
set obs 1
gen x = subinstr(ustrregexra(ustrregexra(fileread("test.txt"),"(n|y)\|((\r\n|\r|\n|$)(\|)?)","$1 end of row "),"(\r\n|\r|\n)","")," end of row ","`=char(10)'",.)
split x, p("`=char(10)'") gen(var)
drop x
gen i = _n
reshape long var, i(i) j(j)
split var, p("|,|")
drop i j var
replace var1 = subinstr(var1,"|","",1)
list, noobs

NB: this code assumes the characters " end of row " do not appear anywhere in the original dataset.

Comment

Tom Groll

Join Date: Jul 2021

Posts: 11
#5

21 Jul 2021, 22:11

Unfortunately, I am getting an error message of "fileread() error 601". The original data set has 8 variables but I don't see how this would make a difference in the code. Does char(10) restrict the length?

Thank you!
Comment

Ali Atia

Join Date: May 2020
Posts: 737

21 Jul 2021, 22:16

You just need to change test.txt (emboldened below) to the name of your textfile:

Code:

clear all
set obs 1
gen x = subinstr(ustrregexra(ustrregexra(fileread("test.txt"),"(n|y)\|((\r\n|\r|\n|$)(\|)?)","$1 end of row "),"(\r\n|\r|\n)","")," end of row ","`=char(10)'",.)
split x, p("`=char(10)'") gen(var)
drop x
gen i = _n
reshape long var, i(i) j(j)
split var, p("|,|")
drop i j var
replace var1 = subinstr(var1,"|","",1)
list, noobs

Comment

Tom Groll

Join Date: Jul 2021

Posts: 11
#7

21 Jul 2021, 22:41

The working directory was off - still running...
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#8

22 Jul 2021, 03:13

According to its helpfile, "import delimited reads into memory a text file in which there is one observation per line and the values are separated by commas, tabs, or some other delimiter." So you will need an alternative approach.

Or may be not. According to the same help file:

bindquotes(loose | strict | nobind) specifies how import delimited handles double quotes in data. Specifying loose (the default) tells import delimited that it must have a matching open and closed double quote on the same line of data. strict tells import delimited that once it finds one double quote on a line of data, it should keep searching through the data for 4 import delimited — Import delimited text data the matching double quote even if that double quote is on another line. Specifying nobind tells import delimited to ignore double quotes for binding.

The command import delimited in Stata does allow to import multiline data, provided that the values are enclosed in double-quotes.

The confusion probably stems from the incorrect description in the problem setting. In this file the delimiter is not the |,| as written in the first post, but simply a comma. The pipe characters | in fact function as double quotes in this particular file layout. So |id1| is in fact "id1" in a classical layout. Import delimited does allow to specify the delimiter, but sadly (afaik) does not allow to override the double-quotes character to be something else then double quotes specifically (perhaps Stata developers can include an option for that?). Then we can use a filefilter command for such replacement:

Code:

clear all tempfile tmp filefilter "C:\temp\ex.txt" "`tmp'", from("|") to(`"""') import delimited `"`tmp'"', delim(",") bindquotes(strict)

The reason for using the pipe character is unclear, and Tom Groll didn't elaborate which particular software produced the file.
If that software is not just weird, but smart - it may have decided to do so, to avoid conflicts with the double-quotes used in values. In which case simple replacement as above will not work (or work incorrectly).
In that case I would temporarily replace the quotes in the original file into something else (like @, ! or any other symbol that is not going to occur) and then replace back when the data is already imported.

Final commend is that the text that was originally having separate words on different lines will look like have lost the delimiters:

but if you do a bit of highlighting, you will notice the delimiter is still there (and it is ASCII char 13):

This way of displaying content is rather weird on the Stata's side. But in practical terms, if you need to process the words as separate subsequently, then you may want to replace the 13 character (CR) with e.g. 32 character (WHITESPACE).

Hope this helps,
Sergiy Radyakin
2 likes
Comment
Tom Groll

Join Date: Jul 2021

Posts: 11
#9

22 Jul 2021, 12:23

Originally posted by Sergiy Radyakin View Post

The confusion probably stems from the incorrect description in the problem setting. In this file the delimiter is not the |,| as written in the first post, but simply a comma. The pipe characters | in fact function as double quotes in this particular file layout. So |id1| is in fact "id1" in a classical layout. Import delimited does allow to specify the delimiter, but sadly (afaik) does not allow to override the double-quotes character to be something else then double quotes specifically (perhaps Stata developers can include an option for that?). Then we can use a filefilter command for such replacement:

Code:

clear all tempfile tmp filefilter "C:\temp\ex.txt" "`tmp'", from("|") to(`"""') import delimited `"`tmp'"', delim(",") bindquotes(strict)

Thank you! I had to think more about the delimiter ("," are all over). I replaced the delimiter with an unused character (~) preserving the other info and respecting the piping, then ran your code. It's all working now!
Comment

Announcement