Import of 18GB csv file

Mark Chubb

Join Date: Feb 2024

Posts: 1
#1

Import of 18GB csv file

14 Feb 2024, 19:33

I am a long time user of STATA, but this is the first large dataset I have tried to manage. I have what may be a very basic question.

Can I use STATA to open this dataset if I wait long enough (I've waited ~30min with no success)? Is my computer memory the limiting factor?

.csv file size: ~18GB
STATA version- Stata/IC 16.1 for Mac (Intel 64-bit)
Computer- 2019 MacBook Pro, 16 GB RAM (2.3Ghz 8-core Intel i9)

I want to drop ~90% of the observations as soon as it is open, but it is the first step that is defeating me

Many thanks
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2371
#2

14 Feb 2024, 19:42

You can do this, but I would hesitate unless I had > 18 GB of RAM, since you may need to read all of the file into memory before you can drop the observations. You can also use -if/in- qualifiers with -import delimited-, but I am unsure right now if that applies at the time of reading in observations, or after the entire file is read in.

If I had to do this, I would consider chunking the file using some other means (e.g., Python) to make more manageable pieces for import into Stata. Since you mention wanting to drop a substantial fraction of observations, I would create a loop to import each chunk, apply the filter, and then append the result to your growing final dataset.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2396
#3

15 Feb 2024, 11:43

The user-written command -chunky- (See -search chunky-) does the kind of thing Leonardo suggests. It will break a large file into pieces and as likely would be desired for a CSV file with variable names on the first line (a header), insert that header into each piece so that it can be imported.

I have not used this program for some time, but my recollection was that it was fast and robust. The command in the current situation might be used something like this:

Code:

chunky using "YourFile.csv", stub(work) header(include) chunksize(1 GB) // // This has broken the original file into chunks about 1GB (user choice) in size, and saved them to // files named work0001, work0002, ... in the current directory // // Import and append each one, dropping the extraneous observations along the way. clear save wholefile.dta, emptyok // file that will hold the pieces foreach f in `s(filelist)' { // chunky leaves behind a list of chunk file names clear di "Working on file: `f'" import delimited using "`f'", ...... drop if SomeCondition // 90% of observations to be dropped, right? append using "wholefile.dta" save "wholefile.dta" }

I don't have any good data around to test this on, but the syntax should be close. Even if your computer *can* handle the whole 18GB file, this might actually be just about as fast. You'd also know where the process was at any point, and if something failed, you'd at least have part of the job done.

Last edited by Mike Lacy; 15 Feb 2024, 11:45. Reason: P.S., after thought: I never tried -chunk- on something this big, and it's an older program, so hearing if it worked or not would be interesting.
1 like
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2371
#4

15 Feb 2024, 13:23

I’m curious about this myself and I had forgotten that -chunk- exists. If I get some time this weekend I might experiment to see how it works.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2396
#5

15 Feb 2024, 17:24

To avoid any confusion: The correct name is -chunky-. Sorry for the error in my followup comment.
Comment
James Hassell (StataCorp)

StataCorp Employee

Join Date: Apr 2015

Posts: 72
#6

15 Feb 2024, 19:26

To add to the excellent guidance others have already given... You may need much more of maybe even less memory to import that entire file, it really depends on its content, and it is difficult to predict. For example, "1.2" needs 3 bytes of storage in the csv file, but will require 4 bytes of storage (float) in Stata. Once a float is needed, then even a single value like "6", for example, will require 4 bytes if in the same column (variable) once imported into Stata. Similarly, any string data will consume as much as the widest value for a particular column (variable). The same rules can apply in reverse when dealing with floating point numbers. For example, "1.234567" will need 8 bytes of storage as a string in the csv file, but will require only 4 bytes to store as a float variable in Stata. With that said, a good starting point is to expect to need about 1.5 times the memory as the original file size.

import delimited does have some facilities that can help you directly. See the options for rowrange() and colrange(). These options work as the csv file is parsed, rather than pruning the dataset after it has been imported. In other words, they are memory efficient.
1 like
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2371
#7

16 Feb 2024, 09:24

If you're will and able to use Python, this program might just be what you're looking for. Check out -readit- by William Buchanan and Brian Quistorff.
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 782
#8

16 Feb 2024, 09:51

Yet another fleksible alternative: read csv into a DB (could be SQLite), then use JDBC or ODBC to connect, and get any subset of data.
Comment

Announcement

Import of 18GB csv file

Comment

Comment

Comment

Comment

Comment

Comment

Comment