Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Import delimited - ignoring linebreaks

    Hello,
    I have (random) linebreaks that I cannot remove with Notepad++ before importing. Three observations, seven variables, delimiter is |,|:

    |id1|,|name1, first1 raw|,|cat1|,|2017|,||,||,|n|
    |id2|,|name2, first1 raw|,|cat2|,|2017|,|text1
    text2
    text3
    text4|,||,|n|
    |id3|,|name3, first3 raw|,|cat3|,|2017|,||,||,|n|

    I am getting too many observations with

    import delimited "file.txt", delimiter("|,|", asstring) bindquote(strict) maxquotedrows(unlimited) clear

    Is there a way to ignore the linebreak and read lines 2-5 as one line/observation?

    Thank you!



  • #2
    According to its helpfile, "import delimited reads into memory a text file in which there is one observation per line and the values are separated by commas, tabs, or some other delimiter." So you will need an alternative approach. Here is one such approach, which relies on the end of each row being "n|" followed by a newline:

    Code:
    clear all
    set obs 1
    gen x = subinstr(ustrregexra(ustrregexra(fileread("test.txt"),"n\|((\r\n|\r|\n|$)(\|)?)"," end of row "),"(\r\n|\r|\n)","")," end of row ","n`=char(10)'",.)
    split x, p("`=char(10)'") gen(var)
    drop x
    gen i = _n
    reshape long var, i(i) j(j)
    split var, p("|,|")
    drop i j var
    replace var1 = subinstr(var1,"|","",1)
    list, noobs
    
      +-----------------------------------------------------------------------------+
      | var1                var2   var3   var4                   var5   var6   var7 |
      |-----------------------------------------------------------------------------|
      |  id1   name1, first1 raw   cat1   2017                                    n |
      |  id2   name2, first1 raw   cat2   2017   text1text2text3text4             n |
      |  id3   name3, first3 raw   cat3   2017                                    n |
      +-----------------------------------------------------------------------------+
    Last edited by Ali Atia; 21 Jul 2021, 21:53.

    Comment


    • #3
      My apologies I just noticed that in my example all observations end with n (no). It is the most common case in the data but there are observations ending with y (yes). Would you run above with replace for the y observations or loop?

      Thank you!

      Comment


      • #4
        Updated code accounting for both y and n endings:

        Code:
        clear all
        set obs 1
        gen x = subinstr(ustrregexra(ustrregexra(fileread("test.txt"),"(n|y)\|((\r\n|\r|\n|$)(\|)?)","$1 end of row "),"(\r\n|\r|\n)","")," end of row ","`=char(10)'",.)
        split x, p("`=char(10)'") gen(var)
        drop x
        gen i = _n
        reshape long var, i(i) j(j)
        split var, p("|,|")
        drop i j var
        replace var1 = subinstr(var1,"|","",1)
        list, noobs
        NB: this code assumes the characters " end of row " do not appear anywhere in the original dataset.

        Comment


        • #5
          Unfortunately, I am getting an error message of "fileread() error 601". The original data set has 8 variables but I don't see how this would make a difference in the code. Does char(10) restrict the length?

          Thank you!

          Comment


          • #6
            You just need to change test.txt (emboldened below) to the name of your textfile:

            Code:
            clear all
            set obs 1
            gen x = subinstr(ustrregexra(ustrregexra(fileread("test.txt"),"(n|y)\|((\r\n|\r|\n|$)(\|)?)","$1 end of row "),"(\r\n|\r|\n)","")," end of row ","`=char(10)'",.)
            split x, p("`=char(10)'") gen(var)
            drop x
            gen i = _n
            reshape long var, i(i) j(j)
            split var, p("|,|")
            drop i j var
            replace var1 = subinstr(var1,"|","",1)
            list, noobs

            Comment


            • #7
              The working directory was off - still running...

              Comment


              • #8
                According to its helpfile, "import delimited reads into memory a text file in which there is one observation per line and the values are separated by commas, tabs, or some other delimiter." So you will need an alternative approach.
                Or may be not. According to the same help file:
                bindquotes(loose | strict | nobind) specifies how import delimited handles double quotes in data. Specifying loose (the default) tells import delimited that it must have a matching open and closed double quote on the same line of data. strict tells import delimited that once it finds one double quote on a line of data, it should keep searching through the data for 4 import delimited — Import delimited text data the matching double quote even if that double quote is on another line. Specifying nobind tells import delimited to ignore double quotes for binding.
                The command import delimited in Stata does allow to import multiline data, provided that the values are enclosed in double-quotes.

                The confusion probably stems from the incorrect description in the problem setting. In this file the delimiter is not the |,| as written in the first post, but simply a comma. The pipe characters | in fact function as double quotes in this particular file layout. So |id1| is in fact "id1" in a classical layout. Import delimited does allow to specify the delimiter, but sadly (afaik) does not allow to override the double-quotes character to be something else then double quotes specifically (perhaps Stata developers can include an option for that?). Then we can use a filefilter command for such replacement:

                Code:
                clear all
                tempfile tmp
                filefilter "C:\temp\ex.txt" "`tmp'", from("|") to(`"""')
                import delimited `"`tmp'"', delim(",") bindquotes(strict)
                The reason for using the pipe character is unclear, and Tom Groll didn't elaborate which particular software produced the file.
                If that software is not just weird, but smart - it may have decided to do so, to avoid conflicts with the double-quotes used in values. In which case simple replacement as above will not work (or work incorrectly).
                In that case I would temporarily replace the quotes in the original file into something else (like @, ! or any other symbol that is not going to occur) and then replace back when the data is already imported.

                Final commend is that the text that was originally having separate words on different lines will look like have lost the delimiters:
                Click image for larger version

Name:	text1.png
Views:	1
Size:	37.4 KB
ID:	1620138


                but if you do a bit of highlighting, you will notice the delimiter is still there (and it is ASCII char 13):
                Click image for larger version

Name:	text2.png
Views:	1
Size:	40.7 KB
ID:	1620139


                This way of displaying content is rather weird on the Stata's side. But in practical terms, if you need to process the words as separate subsequently, then you may want to replace the 13 character (CR) with e.g. 32 character (WHITESPACE).

                Hope this helps,
                Sergiy Radyakin

                Comment


                • #9
                  Originally posted by Sergiy Radyakin View Post
                  The confusion probably stems from the incorrect description in the problem setting. In this file the delimiter is not the |,| as written in the first post, but simply a comma. The pipe characters | in fact function as double quotes in this particular file layout. So |id1| is in fact "id1" in a classical layout. Import delimited does allow to specify the delimiter, but sadly (afaik) does not allow to override the double-quotes character to be something else then double quotes specifically (perhaps Stata developers can include an option for that?). Then we can use a filefilter command for such replacement:

                  Code:
                  clear all
                  tempfile tmp
                  filefilter "C:\temp\ex.txt" "`tmp'", from("|") to(`"""')
                  import delimited `"`tmp'"', delim(",") bindquotes(strict)
                  Thank you! I had to think more about the delimiter ("," are all over). I replaced the delimiter with an unused character (~) preserving the other info and respecting the piping, then ran your code. It's all working now!

                  Comment

                  Working...
                  X