Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Error While Using User Written Chunky

    Hi all,

    I have some extremly large txt files (up to 4+ GB) that I want to try to work with and have tried to use the user written chunky. It works perfectly for all files except those larger than ~ 2.1 GB. I have "analyzed" the files using both chunky (.chunky using *.txt, analyze) and hexdump (.hexdump *.txt, analyze results) and have found nothing out of the ordinary.

    So, here is an example of an attempt to chunk a 3.7 GB file.

    ***start report***

    Chunk fl0001.txt saved. Now at position 100,004,695
    Chunk fl0002.txt saved. Now at position 200,006,031
    Chunk fl0003.txt saved. Now at position 300,006,909
    Chunk fl0004.txt saved. Now at position 400,007,952
    Chunk fl0005.txt saved. Now at position 500,008,476
    Chunk fl0006.txt saved. Now at position 600,008,731
    Chunk fl0007.txt saved. Now at position 700,010,016
    Chunk fl0008.txt saved. Now at position 800,010,463
    Chunk fl0009.txt saved. Now at position 900,010,973
    Chunk fl0010.txt saved. Now at position 1,000,012,845
    Chunk fl0011.txt saved. Now at position 1,100,014,111
    Chunk fl0012.txt saved. Now at position 1,200,015,470
    Chunk fl0013.txt saved. Now at position 1,300,016,746
    Chunk fl0014.txt saved. Now at position 1,400,018,081
    Chunk fl0015.txt saved. Now at position 1,500,019,501
    Chunk fl0016.txt saved. Now at position 1,600,020,438
    Chunk fl0017.txt saved. Now at position 1,700,022,347
    Chunk fl0018.txt saved. Now at position 1,800,023,511
    Chunk fl0019.txt saved. Now at position 1,900,024,496
    Chunk fl0020.txt saved. Now at position 2,000,026,303
    Chunk fl0021.txt saved. Now at position 2,100,027,901
    ftell(): 2094938649 Stata returned error
    chunkfile(): - function returned error
    <istmt>: - function returned error [1]
    r(2094938649);

    end of do-file

    r(2094938649);

    *** end report ****

    I am running Stata MP (Dual Core) 12.1 (ahem, soon upgrading to 13) in Windows 7 - 64 bit.

    I suspect this has something to do with memory (which is the exact reason I was using chunky) but I cannot dicern the problem/workaround. I could not find note of Stata error ftell(): 2094938649. FYI my memory settings are the follows (sorry for the poor tabbing):


    . q memory
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Memory settings
    set maxvar 32000 2048-32767; max. vars allowed
    set matsize 10000 10-11000; max. # vars in models
    set niceness 5 0-10
    set min_memory 0 0-3200g
    set max_memory . 64m-3200g or .
    set segmentsize 64m 1m-32g

    Finally as an FYI, here is an approximate version of the code:

    ****start ado *****

    cd "B:/Folder A"
    local files : dir . files "*.txt"

    foreach f of local files {
    cd "B:/FolderB //so that files are saved here
    local newf = subinstr("`f'",".txt","",.) //extract the prefix
    chunky using "B:/Folder A/`f'", chunksize(100 mb) header(include) stub(`newf') replace
    cd "B:/Folder A" //back to the folder with the files to be chunked
    }
    *
    ****end ado*****

    Any ideas of how to solve?

    (FYI I have also emailed the Chunky author David Elliott directly.)

    Thanks, as always, in advance,

    Ben

  • #2
    Ben,

    According to the Mata documentation:

    fseek(fh, offset, whence) and _fseek(fh, offset, whence) abort with error if offset is
    outside the range +/-2,147,483,647 on 32-bit computers; if offset is outside the range
    +/-9,007,199,254,740,991 on 64-bit computers; or, on all computers, if whence is not
    -1, 0, or 1.
    It doesn't mention ftell(), but I suspect that the limitation is the same in both cases (ftell() reports the current position in the file, fseek() goes to a certain position in the file).

    Regards,
    Joe

    Comment


    • #3
      Thanks Joe.

      That definitely seems to be at the root of it. I am a very novice Mata user, so I am not sure I understand, but it does seem that then there is a question as to why my 64-bit OS is not registering in this routine. When I click on "about" inside Stata, it shows it as the 64 bit version. Maybe the routine was not written for 64 bit computers and therefore only uses the 32 bit determinations. It looks like the new version of Chunky was updated in 2010. Maybe Dave Elliot will have more details.

      Thanks,

      Ben

      Comment


      • #4
        I got the same thing on my 64-bit Windows OS, so I don't think it is you. Looking at the chunky code, I'm not sure how the author could have done anything differently to take advantage of a 64-bit OS. Perhaps the fseek() function takes advantage of it, but ftell() doesn't. Note that the long data type in Stata is 4 bytes long and has a limit of 2.1 billion, but that this is the same whether using 32-bit or 64-bit OS. Likewise, perhaps fseek() adjusts to the OS, while ftell() uses a fixed length data type.

        Comment


        • #5
          Ben,

          Thank you for your inquiry regarding -chunky-, an ado I wrote a number of years ago now to provide an in-Stata method of splitting very large data dumps into sizes manageable within 32bit Stata limits. Over the past couple of years various people have had errors with -chunky- that I have been unable to unravel. It will work for most files and then chokes on a particular one. I have had people chunk files in excess of 200 million records and 100GB with -chunky- and then have it stop in the middle of a smaller file. I can't replicate the error since there is no way I can debug using the particular source file.

          One of the users found a freeware utility called gsplit [http://www.gdgsoft.com/gsplit/] that he found worked well and will do the same thing -chunky- does (and far faster). Specifically, it will work in command-line/batch mode so you can use a -!gsplit args- or -winexec gsplit args-. It will also replicate a header from a csv file across all the chunks so that you could -insheet- them individually.

          I would recommend you look into this as a replacement for -chunky- in your situation. It is a bit unwieldy to use the first time. An option you need to look for is under the "Pieces" menu and the section on "Blocked Piece Properties" There is a drop down option box with a "I want to split after the nth occurrence of a specified pattern" - a rather obscure way of saying you want to split at 0x0D0x0A the Windows end of line characters. This then allows you to split at the end of n lines. I've attached a screenshot with an annotated sequence of choices.

          You can work the program in batch mode - look in the help file for "Batch, Command Line Options, Predefined values." Let me know how it works for you.

          Regards,
          David Elliott

          Comment


          • #6
            David,

            Thank you. I have now successfully used the gssplit software (after some trial and error). It worked perfectly. It had all the features I needed to do the job.

            I now have managely sized txt files that I can zip through, pull out the cases I want, keep only the variables I want, and distill them down to a usable Stata dataset.

            Thanks for the advice,

            Ben

            Comment

            Working...
            X