Running time of program

Ron Alquist

Join Date: Aug 2017

Posts: 5
#1

Running time of program

25 Aug 2017, 06:59

Hi,

I'm using rolling in Stata 14.1 on a Windows 7 machine with 32GB of RAM. I have an unbalanced panel data set of mutual funds. There are 1,037,398 rows and 15 columns with various characteristics of the funds in the data set, including returns. I want to calculate the rolling average, standard deviation, skew, and kurtosis of returns by mutual fund for the past 12-, 24-, and 36-months.

The program I wrote works as expected and finishes in a more or less reasonable amount of time for subsamples of the data. For example, for the 12-month horizon it takes 25 second to run on a subsample of 10,000 observations and 23 minutes on a subsample of 100,000 observations. But the program ran for over 15 hours without completing when I tried it on the full sample of 12-month returns. I've tried a few things based on discussions in this forum, such as using the command asrol that was written by another user. Doing so on the subsamples of my data set causes Stata to freeze. I also understand that the set memory option has been deprecated in recent versions of Stata. (I've used earlier versions of Stata in my research several years ago.)

I'm happy to post a sample data set, but I'm not sure it would be helpful because I know the program works on the subsamples. The problem seems to be related to how memory is allocated in Stata. The code fragment below is the one that calculates the moments of returns. I declare the data set to be a panel, drop the irrelevant variables, and then use rolling in the standard way to compute the statistics. I don't see why it takes so long on the full data set.

Thank you.

Ron

so fundid date_ym
tsset fundid date_ym

* net or gross returns
gen ret = ret_net
*gen ret = ret_gross

*keep fund_id ret date ticker
keep fundid ret date_ym fund_id
drop if ret>=.
save temp, replace

timer clear 1
timer on 1

rolling avg12=r(mean) sd12=r(sd) skew12=r(skewness) kurt12=r(kurtosis), window(12) clear nodots: summarize ret, detail
label var avg12 "Average 12 month return (ending current month)"
label var sd12 "St.dev. of 12 month return (ending current month)"
label var skew12 "Skewness of 12 month return (ending current month)"
label var kurt12 "Kurtosis of 12 month return (ending current month)"
gen date = end
format date %tm
save rolling_output12, replace

timer off 1
timer list 1
Tags: None
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#2

25 Aug 2017, 07:54

The core of the issue can be experienced with the following dummy code (using timeit from ssc):

Code:

clear all set obs 1000 gen x = runiform() timeit 1: sum x set obs 100000000 timeit 2: sum x if _n < 1000 timeit 3: sum x in 1/1000 timer list clear mata timer_clear() X = runiform(1000,1) timer_on(1) mean(X) timer_off(1) Xm = J(100000000-1000, 1, .) X2 = X\Xm timer_on(2) mean(X[1..1000]) timer_off(2) timer() end

Which on my pc led to following timings

Code:

1: 0.00 / 1 = 0.0010 2: 2.37 / 1 = 2.3710 3: 0.00 / 1 = 0.0010 4. .002 / 1 = .002 5. 0 / 1 = 0

Note that in all cases, all that was asked if to perform the sum command on 1000 observations, the only difference is in how much work Stata had to do to find those 1000. The clear conclusion is that Stata is very fast at finding those... unless it's done through an if statement.

The downside of the rolling command is that the only way it can be as flexible as it is, is by using an if statement, in particular, if inrange(<timevar>, <start>, <end>). It has to do that in every iteration and we just saw that these iterations take longer the larger your dataset. So not only do you get many more iterations, they also each take longer. You can imagine how quickly this gets impossible.

In your case, there might be a "hacky" solution, inspired by a post from Nick Cox from ages past. I don't know if this is the best solution, but it should work.

Code:

clear set obs 1000 egen t = seq(), to(20) egen n = seq(), block(20) gen return = runiform() xtset n t forvalues i = 0/11 { gen return_L`i' = L`i'.return } egen nonMissing = rownonmiss(return_L*) egen meanReturn = rowtotal(return_L*) replace meanReturn = meanReturn/nonMissing

What I've done here is just manually create the windows and then using the formula for the mean to calculate it. You can also use the rowmean() function of egen of course. I don't remember how skewness and kurtosis are calculated, but IIRC their formulas are not complicated. The advantage of this method is that in larger datasets the number of calculations increases, but the time it takes to do each calculation remains the same.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

25 Aug 2017, 08:02

Just use rangestat (from SSC). It can generate the desired results on the full sample in less than a minute. Here's a quick example:

Code:

clear all
set obs 3000
gen long fundid = _n
expand 360
bysort fundid: gen date_ym = _n
gen ret = runiform()

rangestat (count) nobs=ret (mean) avg12=ret (sd) sd12=ret (skewness) skew12=ret ///
    (kurtosis) kurt12=ret, interval(date_ym -35 0) by(fundid)

Comment

Ron Alquist

Join Date: Aug 2017

Posts: 5
#4

25 Aug 2017, 08:13

Thanks for the reply and proposed solution. I may try it. For now, I'm going to do the calculations in Python and then export them to Stata to do the data analysis. Python should run significantly faster.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

25 Aug 2017, 08:32

Ron, did you miss my point that you can perform these calculations on the full sample in less than 1 minute? Here's a revised example that performs all 3 rolling windows (12, 24, and 36 month) on a test dataset with 1,080,000 observations:

Code:

clear all
set obs 3000
gen long fundid = _n
expand 360
bysort fundid: gen date_ym = _n
gen ret = runiform()

timer on 1
rangestat (count) nobs12=ret (mean) avg12=ret (sd) sd12=ret (skewness) skew12=ret ///
    (kurtosis) kurt12=ret, interval(date_ym -11 0) by(fundid)
rangestat (count) nobs24=ret (mean) avg24=ret (sd) sd24=ret (skewness) skew24=ret ///
    (kurtosis) kurt24=ret, interval(date_ym -23 0) by(fundid)
rangestat (count) nobs36=ret (mean) avg36=ret (sd) sd36=ret (skewness) skew36=ret ///
    (kurtosis) kurt36=ret, interval(date_ym -35 0) by(fundid)
timer off 1
timer list

And here is the timer result:

Code:

. timer list
   1:     89.79 /        1 =      89.7870

All done!

Comment

Jesse Wursten

Join Date: Jan 2016

Posts: 915
#6

25 Aug 2017, 08:39

Originally posted by Robert Picard View Post

[...]

I knew there would be an easier solution...
Comment
Ron Alquist

Join Date: Aug 2017

Posts: 5
#7

25 Aug 2017, 10:19

Hi Robert,

Indeed I did. My earlier post was directed at Jesse. I saw your post after I responded to him.

Your solution worked well. The running time on the full data set for all three horizons was 69 seconds. And it didn't require me to go between Stata and Python, which would've been inconvenient. Thank you very much for your help.

Ron
1 like
Comment
Patrick Shin

Join Date: Mar 2018

Posts: 12
#8

21 Jul 2019, 10:01

Ron, I appreciate you providing this valuable information. I have a very silly question. I followed your codes and got extra variables such as nobs12, avg12, sd12, skew12, and kurt12, and so on. If I need summary stats from those new variables, could I use just Summarize * simply? I am currently dealing with some panel data and do want to get summary stats firm by firm.
Comment

Announcement

Running time of program

Comment

Comment

Comment

Comment

Comment

Comment

Comment