Why are small regressions slow in large datasets?

Paul Geertsema

Join Date: Apr 2014

Posts: 2
#1

Why are small regressions slow in large datasets?

11 Apr 2014, 16:11

Consider running a regression on the first 200 consecutive observations in a dataset of arbitrary size. This takes 0.04 seconds if the host dataset has 1000 observations. If the host dataset contains 100 million observations, the same regression on 200 consecutive observations takes 3.8 seconds - almost 1000 times longer.

The following minimal working example should help to convince the sceptical. (16GB of physical memory is recommended for this example)

Code:

version 13 clear set obs 100000000 gen x1 = rnormal() gen x2 = rnormal() gen e = rnormal() gen y = 3*x1 + 2*x2 + e * Case 1: regression sample = 100mn, host dataset = 100mn * Takes 6.65 seconds timer clear timer on 1 regress y x1 x2 timer off 1 timer list * Case 2: regression sample = 200, host dataset = 100mn * Takes 3.79 seconds timer clear timer on 1 regress y x1 x2 in 1/200 timer off 1 timer list * Case 3: regression sample = 200, host dataset = 1000 * Takes 0.04 seconds keep in 1/1000 timer clear timer on 1 regress y x1 x2 in 1/200 timer off 1 timer list

This is not likely to be a problem if running only one regression. However, if we consider rolling regressions in a large panel dataset, this performance penalty is multiplied thousands or even millions of times. This is not a contrived example - in finance, it is common to run rolling window regressions on individual stocks in large panel datasets in order to estimate various risk sensitivities. To provide a sense of scale, the CRSP dataset of historical US stock returns contain 83 million daily stock return observations - not much less than the 100 million observations in the example above.

This motivated me to look a bit deeper into this issue. I ended up writing a paper about it (I am an academic, after all). A draft can be found here: http://papers.ssrn.com/abstract=2423171

Briefly, it appears that small regressions in large datasets are slow because of the overhead required to mark the estimation sample in -regress-. This suggest that the same problem is likely to
affect other estimation commands in Stata, although I did not check.

In the paper above I suggest two solutions, however, I would welcome comments/criticism/advice from the Stata community. My question: is there a better way to solve this problem?

Here are my solutions:

1) Break a large dataset into many small datasets, estimate whatever it is you need, and then recombine the results. This works, but it is not particularly elegant. Also, the splitting and combining adds additional processing overhead.

2) Rewrite the estimation command so as to avoid the penalty involved in creating a new e(sample) variable in the dataset each time the estimation command is called. This is easy to do in Mata using st_view() to access consecutive observations. This is the approach I use in the paper: I basically write a toy implementation of OLS regression in Mata, and package it as a Stata ado file (called -fastreg- of course). (The source code for -fastreg- is available in the appendix to the paper and is open-source licenced under the GNU GPL.)

From a practical point of view, this does appear to solve the specific problem of estimating rolling regressions in large panel datasets. In an empirical finance setting, -fastreg- is 367 time faster than -regress- when calculating rolling windows regressions.

I should point out that this does not mean -regress- is slow in general. In fact, -regress- is about 4 times faster than -fastreg- in large datasets when the full dataset is used as the sample. The real performance issue occurs when estimating millions of small regression in very large datasets.

I believe this is a practical issue for many (search "stata rolling slow", and you'll see what I mean). Any thoughts?
Tags: fastreg, performance, regress, rolling, slow

Announcement

Why are small regressions slow in large datasets?