Error While Using User Written Chunky

Ben Hoen

Join Date: May 2014

Posts: 85
#1

Error While Using User Written Chunky

06 May 2014, 10:02

Hi all,

I have some extremly large txt files (up to 4+ GB) that I want to try to work with and have tried to use the user written chunky. It works perfectly for all files except those larger than ~ 2.1 GB. I have "analyzed" the files using both chunky (.chunky using *.txt, analyze) and hexdump (.hexdump *.txt, analyze results) and have found nothing out of the ordinary.

So, here is an example of an attempt to chunk a 3.7 GB file.

***start report***

Chunk fl0001.txt saved. Now at position 100,004,695
Chunk fl0002.txt saved. Now at position 200,006,031
Chunk fl0003.txt saved. Now at position 300,006,909
Chunk fl0004.txt saved. Now at position 400,007,952
Chunk fl0005.txt saved. Now at position 500,008,476
Chunk fl0006.txt saved. Now at position 600,008,731
Chunk fl0007.txt saved. Now at position 700,010,016
Chunk fl0008.txt saved. Now at position 800,010,463
Chunk fl0009.txt saved. Now at position 900,010,973
Chunk fl0010.txt saved. Now at position 1,000,012,845
Chunk fl0011.txt saved. Now at position 1,100,014,111
Chunk fl0012.txt saved. Now at position 1,200,015,470
Chunk fl0013.txt saved. Now at position 1,300,016,746
Chunk fl0014.txt saved. Now at position 1,400,018,081
Chunk fl0015.txt saved. Now at position 1,500,019,501
Chunk fl0016.txt saved. Now at position 1,600,020,438
Chunk fl0017.txt saved. Now at position 1,700,022,347
Chunk fl0018.txt saved. Now at position 1,800,023,511
Chunk fl0019.txt saved. Now at position 1,900,024,496
Chunk fl0020.txt saved. Now at position 2,000,026,303
Chunk fl0021.txt saved. Now at position 2,100,027,901
ftell(): 2094938649 Stata returned error
chunkfile(): - function returned error
<istmt>: - function returned error [1]
r(2094938649);

end of do-file

r(2094938649);

*** end report ****

I am running Stata MP (Dual Core) 12.1 (ahem, soon upgrading to 13) in Windows 7 - 64 bit.

I suspect this has something to do with memory (which is the exact reason I was using chunky) but I cannot dicern the problem/workaround. I could not find note of Stata error ftell(): 2094938649. FYI my memory settings are the follows (sorry for the poor tabbing):

. q memory
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Memory settings
set maxvar 32000 2048-32767; max. vars allowed
set matsize 10000 10-11000; max. # vars in models
set niceness 5 0-10
set min_memory 0 0-3200g
set max_memory . 64m-3200g or .
set segmentsize 64m 1m-32g

Finally as an FYI, here is an approximate version of the code:

****start ado *****

cd "B:/Folder A"
local files : dir . files "*.txt"

foreach f of local files {
cd "B:/FolderB //so that files are saved here
local newf = subinstr("`f'",".txt","",.) //extract the prefix
chunky using "B:/Folder A/`f'", chunksize(100 mb) header(include) stub(`newf') replace
cd "B:/Folder A" //back to the folder with the files to be chunked
}
*
****end ado*****

Any ideas of how to solve?

(FYI I have also emailed the Chunky author David Elliott directly.)

Thanks, as always, in advance,

Ben
Tags: None
Joe Canner

Join Date: Mar 2014

Posts: 580
#2

06 May 2014, 12:40

Ben,

According to the Mata documentation:

fseek(fh, offset, whence) and _fseek(fh, offset, whence) abort with error if offset is
outside the range +/-2,147,483,647 on 32-bit computers; if offset is outside the range
+/-9,007,199,254,740,991 on 64-bit computers; or, on all computers, if whence is not
-1, 0, or 1.

It doesn't mention ftell(), but I suspect that the limitation is the same in both cases (ftell() reports the current position in the file, fseek() goes to a certain position in the file).

Regards,
Joe
Comment
Ben Hoen

Join Date: May 2014

Posts: 85
#3

06 May 2014, 15:24

Thanks Joe.

That definitely seems to be at the root of it. I am a very novice Mata user, so I am not sure I understand, but it does seem that then there is a question as to why my 64-bit OS is not registering in this routine. When I click on "about" inside Stata, it shows it as the 64 bit version. Maybe the routine was not written for 64 bit computers and therefore only uses the 32 bit determinations. It looks like the new version of Chunky was updated in 2010. Maybe Dave Elliot will have more details.

Thanks,

Ben
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#4

06 May 2014, 16:01

I got the same thing on my 64-bit Windows OS, so I don't think it is you. Looking at the chunky code, I'm not sure how the author could have done anything differently to take advantage of a 64-bit OS. Perhaps the fseek() function takes advantage of it, but ftell() doesn't. Note that the long data type in Stata is 4 bytes long and has a limit of 2.1 billion, but that this is the same whether using 32-bit or 64-bit OS. Likewise, perhaps fseek() adjusts to the OS, while ftell() uses a fixed length data type.
Comment
DCElliott

Join Date: May 2014

Posts: 2
#5

07 May 2014, 12:02

Ben,

Thank you for your inquiry regarding -chunky-, an ado I wrote a number of years ago now to provide an in-Stata method of splitting very large data dumps into sizes manageable within 32bit Stata limits. Over the past couple of years various people have had errors with -chunky- that I have been unable to unravel. It will work for most files and then chokes on a particular one. I have had people chunk files in excess of 200 million records and 100GB with -chunky- and then have it stop in the middle of a smaller file. I can't replicate the error since there is no way I can debug using the particular source file.

One of the users found a freeware utility called gsplit [http://www.gdgsoft.com/gsplit/] that he found worked well and will do the same thing -chunky- does (and far faster). Specifically, it will work in command-line/batch mode so you can use a -!gsplit args- or -winexec gsplit args-. It will also replicate a header from a csv file across all the chunks so that you could -insheet- them individually.

I would recommend you look into this as a replacement for -chunky- in your situation. It is a bit unwieldy to use the first time. An option you need to look for is under the "Pieces" menu and the section on "Blocked Piece Properties" There is a drop down option box with a "I want to split after the nth occurrence of a specified pattern" - a rather obscure way of saying you want to split at 0x0D0x0A the Windows end of line characters. This then allows you to split at the end of n lines. I've attached a screenshot with an annotated sequence of choices.

You can work the program in batch mode - look in the help file for "Batch, Command Line Options, Predefined values." Let me know how it works for you.

Regards,
David Elliott

1 Photo
Comment
Ben Hoen

Join Date: May 2014

Posts: 85
#6

07 May 2014, 13:31

David,

Thank you. I have now successfully used the gssplit software (after some trial and error). It worked perfectly. It had all the features I needed to do the job.

I now have managely sized txt files that I can zip through, pull out the cases I want, keep only the variables I want, and distill them down to a usable Stata dataset.

Thanks for the advice,

Ben
Comment

Announcement

Error While Using User Written Chunky

Comment

Comment

Comment

Comment

Comment