Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -fileread/write()- can make a faster/better -filefilter-

    I'm just writing to note that -fileread()- and -filewrite()- can function nicely as part of a *faster* and more sophisticated version of -filefilter-.

    Illustration: While recently working with a 325M text file, I wanted to filter it in various ways, including the simple task of changing all instances of multiple blanks to one. I started off by using multiple calls to -filefilter- (change 10 blanks to 1, 9 blanks to 1 ...., 2 blanks to 1). I then thought to use -fileread- and -filewrite()- with ordinary string functions:
    Code:
    gen strL s = fileread(infile)
    replace s = itrim(s)
    gen b - filewrite(outfile,s)
    This approach was not only convenient, but also much faster, at least on a Windows machine. For example, on my 325M file, just the partial task of substituting one blank for two blanks, using -filefilter-, took something like 30 sec. Doing the whole, task, using fileread/write and itrim() as above, took less than 4 sec. In principle, one could use this to create an enhanced version of -filefilter- that accepted regular expressions as needed, worked with binary as well as text files, etc. -fileread- is limited by the 2G limit maximum length of a strL, but that's not typically an issue.

    This is one of several instance in which -fileread/write- have impressed me with their speed and convenience.

    Regards, Mike

  • #2
    For Stata 13 only, of course...

    Comment


    • #3
      Mike,

      this is clearly because fileread would read the whole file in one operation, while filefilter is using hundreds of calls to OS to process a file. This is apparent from documentation:

      Because of the buffering design of filefilter, arbitrarily large files can be converted quickly
      (from here)

      although as you show "quickly" is not as quick as it can be. But I guess the arbitrarily large is key feature of filefilter. The size of the buffer is not known for sure, and may vary between versions. But as far as I recall there was one bug related to the pair CR/LF being split on the margin of blocks of size 512 bytes (long time ago, v9 or 10). Though very indirect, it could be evidence that the internal read buffer can be as small as that, in which case 325M is a huge job.

      Buffering output can be a tremendous performance improvement. I had a similar rate of speed up for usespss (an early unreleased version used to write directly to file). This was later change to use a fixed 1MB buffer (deemed always possible on any modern computer), and you can see evidence of this in the usespss output:
      ...
      Optimized data record size is:3
      Allocated 1048576 bytes for the write buffer
      Write buffer capacity is 349525 records

      ...

      Best, Sergiy Radyakin

      Comment


      • #4
        Hi Sergiy,

        I also can imagine this behavior differing depending on the hardware, which might transparently buffer i/o in ways that make -filefilter()- faster. What surprised me here was that my DIY version of a pretty low level process was faster than a built in version. It would be nice if -fileread()- could be instructed to read a certain number of bytes <= 2e9, rather than erring, which would allow for using -fileread- on arbitrarily large files. (The user could do some kind of "while not eof, get the next 2e9 bytes") . From my perspective, reading the first 2e9 bytes of a file and returning an error code, as necessary, would be more Stata-ish than the current behavior of not reading anything from too large a file.


        Regards, Mike

        Comment


        • #5
          Mike, I don't think reading is a problem:

          Code:
          mata
             fh=fopen("c:\data\file,txt","r")
             s=fread(fh,78000000)
             strlen(s)
             fclose(fh)
          end
          The problem is more in:
          get the next 2e9 bytes
          The fseek() on 32-bit Stata is limited to 2GB and will not seek past that. So you can't get next 2e9 on a 32-bit Stata. Also, while Stata's dataset memory manager is revolutionized since v12 to allow non-continuous spans of memory (afaik), nothing like that is said about strings. So the longest string you can read on a 32-bit machine is likely way below 2GB and will still compete with the main dataset for memory or swapping.

          On 64-bit Stata you can process any files and this is what I am doing with my data conversion routines on an everyday basis. So, yes, you can create a faster implementation of filefilter than the built-in, if you put enough effort.

          Another part of efficiency gain may be coming from filefilter supporting some options and parsing for special chars, etc, which you probably don't do. But my estimate that its contribution is small.

          Best, Sergiy Radyakin

          PS: filefilter supports ebcdic2ascii conversion. If anyone has an original real-life file in EBCDIC and can share it with me (ideally with its history attached), please write a private message to me (using this forum software: right-click my name, switch to "Activities", click "Private message").

          Comment


          • #6
            Good points. I had barely used the -fread()- function in Mata. I experimented and found that it can read (for example) a 1G file and -st_sstore()- it in a strL in about the same amount of time that -fileread() can (about 1.5sec for the former, about 0.7 for the latter), so that the flexibility of -fread()- to read a designated number of bytes would generally be worth a small absolute time cost.

            Regards, Mike

            Comment

            Working...
            X