Running a Regression Faster

Paul Smith

Join Date: Jul 2019

Posts: 16
#1

Running a Regression Faster

11 Nov 2019, 17:19

Hello,

I'm running a conditional logistic regression, with over 400 variables and 750000 rows of data. At present it takes 7-8 minutes to process, but due to the nature of the financial trading I'm using the model for, I need to run the model 30 to 40 times a day, as I add real time market information to one of the variables.

I've tried a two step process but I'm losing predictive accuracy.

Is there any way to run a regression of this scale quickly (45-60 seconds)?

Thanks,
Paul
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#2

11 Nov 2019, 18:05

Hi Paul,

That runtime doesn't necessarily seem outrageous given the model and dimensions of your data. Unfortunately, it's not clear how you can proceed in a specific way. Here are some general things to consider (of course, subject to information you have not given us about your problem):
- 30-40 runs / day at 8 minutes / run is achievable in terms of time (total expected running time is ~4-5 hours). If this is not practical because if data manipulation steps, you could try automating as much as possible and running your program as a scheduled batch job.
- it's possible that intermediate results of your dataset require caching to disk, so monitoring RAM usage during operation could identify one possible bottleneck
- using a (greater) version of Stata/MP will allow multicore processing, which generally increases computation speed. Ditto for faster processor.
- if you are using -margins- after the regression to get point estimates, you can try the -nose- option to forgo standard error calculations which can be quite slow.
2 likes
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#3

11 Nov 2019, 19:21

Couple of other things to consider, although they probably won't add as much as Leonardo's suggestions.

1. If the data are just updates from 10 to 12 minutes ago, then maybe you can use the last fitted estimates as starting values for the next fit, which might reduce the number of iterations to convergence.

2. A long shot, but if all 400 predictors are categorical, then you might gain a little with -contract- and using -fweight- with -clogit- (-xtlogit , fe- doesn't allow -fweight-) Note that weights are expected to be constant within panels.
2 likes
Comment
Paul Smith

Join Date: Jul 2019

Posts: 16
#4

18 Nov 2019, 05:40

Many thanks for the replies and apologies for the delay in getting back to this post.

I've decided to split the data, one dataset which I can estimate in the morning, then join that _predict_ output with another live dataset from the market in the afternoon. Reduces my compute time down to about 45 seconds.
Comment

Announcement

Running a Regression Faster

Comment

Comment

Comment