Using Stata for very large data sets

Nick Richardson

Join Date: Aug 2014

Posts: 57
#1

Using Stata for very large data sets

11 Jun 2015, 10:19

I am currently trying to use Stata for a very large data set (millions of cases, 10+GB) and it takes FOREVER to even open the data. It took me over 20 minutes just to run a simple frequency on one var. Ive changed all the memory settings in stata so no problem there.
Unfortunately, I've had to switch over to SPSS because it seems to handle larger data sets a little better and runs analyses quicker than Stata. However, I really would prefer to use Stata so I was wondering if anyone here has worked with very large data sets and has some suggestions????
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#2

11 Jun 2015, 10:50

I frequently have data sets of 20-25 GB; the main things I have learned:

1. use "if" as little as possible as it is very slow; in fact, for one project where I needed to estimate 16 regressions on each subset of the data it was much faster to use if only once - to keep only the subset immediately wanted

2. for many data mgt tasks, it is much faster to break the data into pieces and do the task on each piece and then append the pieces (-reshape- is an example)

3. I have found that the second (etc.) time reading a large data set is much faster than the first time - which is what makes the strategy in point 1 above reasonable
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29963
#3

11 Jun 2015, 12:26

3. I have found that the second (etc.) time reading a large data set is much faster than the first time - which is what makes the strategy in point 1 above reasonable

I'm not disputing your observation, but I can't understand how that's possible. Unless you somehow make changes to the dataset and save it, I don't see how reading the same data the second time can be faster. Does Stata remember some kind of meta-data about datasets we use? If so, how long does it retain it: if you use a data set and then don't use it again for a long time, does the speedup not occur? I just don't get how this could work.

(Posed purely out of curiosity. I've never worked with data sets that large.)
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#4

11 Jun 2015, 12:36

me neither <grin> but I have noticed it on multiple data sets; I use Mac OS X (current updated version); maybe someone else has an idea?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#5

11 Jun 2015, 16:50

At least with Windows if your machine has enough available volatile memory (RAM), then the operating system will usually cache the file in it, in anticipation of having to use the file again soon. For very large datasets, overall speed of many operations is often limited by I/O, that is, reading from the disc (or over the wire if the file resides on a server). Accessing a cached file is much faster. Once another application (or file) calls for the memory, however, the operating system will release it, dropping the cached file, and so any read of the file taking place after that will have to come from the disc (over the wire) again. I assume that Macintosh operating systems behave similarly.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#6

11 Jun 2015, 17:15

Joseph Coveney is mostly right, though actually there are multiple caches it may use (pure RAM, on-board cache, and disk-caching).

In any case, how much RAM do you have on your computer? If you go over the amount of RAM you have available, it will slow down dramatically. Obviously, boosting RAM is one approach, and nowadays, not all that expensive. What kind of hard drive? If you max out on RAM and are stuck using the drive, getting an SSD hard drive can help a lot. Is it a laptop or a desktop? If it's a desktop, I recommend having an SSD for boot and working files, and keeping a traditional spinning platter drive for bulk storage. Oh, and Joe makes a very good point in passing -- does the file reside on a server or the local system? Copying it off the server to the local drive can make a world of difference.

Finally, do you *need* all your cases all the time? With that many cases, you could run most commands on a random subset and be confident your results will be close to the results with the full dataset.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

11 Jun 2015, 17:22

It's the same on a Mac. Modern operating systems do I/O caching. This is why you don't hear much talk about RAM disks anymore. With big data, a fast SSD is also crucial. Here's a quick example of the effect on my computer.

Code:

. set rmsg on
r; t=0.00 19:14:56

. use "/Users/robert/Documents/projects/raw_data/raw_data_combo.dta"
(Raw 1995 1998 2001 2004 2006 2008 2009 2011 combo)
r; t=3.62 19:15:12

. tempfile f
r; t=0.01 19:15:18

. save "`f'"
file /var/folders/cp/z8cssshn6935x9p181c71_7m0000gn/T//S_00348.000001 saved
r; t=4.23 19:15:27

. use "/Users/robert/Documents/projects/raw_data/raw_data_combo.dta"
(Raw 1995 1998 2001 2004 2006 2008 2009 2011 combo)
r; t=0.76 19:15:33

. clear
r; t=0.07 19:15:38

. use "/Users/robert/Documents/projects/raw_data/raw_data_combo.dta"
(Raw 1995 1998 2001 2004 2006 2008 2009 2011 combo)
r; t=1.49 19:15:42

. use "`f'"
(Raw 1995 1998 2001 2004 2006 2008 2009 2011 combo)
r; t=0.79 19:15:47

. use "/Users/robert/Documents/projects/raw_data/raw_data_combo.dta", clear
(Raw 1995 1998 2001 2004 2006 2008 2009 2011 combo)
r; t=0.83 19:15:59

. des, short

Contains data from /Users/robert/Documents/projects/raw_data/raw_data_combo.dta
  obs:     9,294,857                          Raw 1995 1998 2001 2004 2006 2008 2009 2011 combo
 vars:            29                          24 Sep 2014 12:04
 size: 2,797,751,957                          
Sorted by:  raw_id
r; t=0.00 19:16:07

Comment

West Addison

Join Date: Jun 2015

Posts: 13
#8

11 Jun 2015, 20:50

If it is taking over 20 minutes to run a simple frequency of one variable, Ben's comment that "if you go over the amount of RAM you have available, [Stata] will slow down dramatically" is almost certainly the explanation. Unlike some other statistical packages, such as SPSS or SAS, Stata loads the entire data set into memory. This has significant speed advantages, provided the data set actually fits in the physical memory available. However, if there's not enough physical memory to hold the data set, in order to be able to load it at all the operating system has to supply virtual memory in the form of a page file on disk, and unless your page file is on a solid state drive Stata will become painfully slow as the operating system swaps portions of the data set into and out of physical memory from disk.

So the number one rule when working with large data sets is to avoid exceeding the amount of RAM available. If you don't need the entire data set for your analysis, you might specify just the variables or observations you need when you read in the data set in the -use- statement. You might be able to run your analysis on a random sample or break the data set into chunks, process each chunk, and combine the results afterwards. Other options would be to run the analysis on a computer with more RAM or to buy more RAM for your computer. Also, run the -compress- command on your data set to make sure that it's not wasting space with needlessly long data types.

Of course, the larger your data set is, the longer Stata commands will take to run, even if the data fit in RAM. Therefore, writing your programs as efficiently as possible with regard to speed becomes a bigger priority. For example, instead of defining a new variable with a -generate- followed by multiple -replace- commands with -if- clauses, you could create that variable with a single -generate- statement using nested cond() functions. In general, commands implemented as ado-files, since they must be interpreted, are a lot slower than built-in commands, so if possible use built-in commands instead of ado-file commands. (You can use the -which- command to determine whether a command is built in or not.) For example, instead of coding

by y: egen int tot = total(x)

you could code

by y: gen int tot = sum(x)
by y: replace tot = tot[_N]

The latter, despite using two commands rather than one, will run a lot faster because -generate- and -replace- are built in but -egen- is implemented as an ado-file. Also, since sorting is time consuming, do as little of it as possible.

Last edited by West Addison; 11 Jun 2015, 20:57.
Comment
Robert Grant

Join Date: Apr 2014

Posts: 58
#9

12 Jun 2015, 03:54

I've been in this situation a few times recently. There is no universal solution apart from cracking up the data you are given and working in chunks. I have a couple of little home made utilities written in C++ for this which I will tidy up and put on github sometime before too long. Simpler, but rarely possible, is to ask the database admin to give it to you in a sensible form, not one massive flat file. Often though, my experience has been that talking to these people is more painful and time consuming than just dealing with it yourself. If you can reduce the chunks to what you need and then recombine them and proceed in Stata, cool. If not, you may have to work out the stepping stones in your calculation like sums of squares for each chunk and then manipulate them. We know that a random sample will be well-behaved, but I know the 'client' usually wants to quote a large N and won't accept that. This problem is not going to go away any time soon! The good news is that it's quite a fun puzzle.
Comment

Announcement

Using Stata for very large data sets

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment