Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Running time of program

    Hi,

    I'm using rolling in Stata 14.1 on a Windows 7 machine with 32GB of RAM. I have an unbalanced panel data set of mutual funds. There are 1,037,398 rows and 15 columns with various characteristics of the funds in the data set, including returns. I want to calculate the rolling average, standard deviation, skew, and kurtosis of returns by mutual fund for the past 12-, 24-, and 36-months.

    The program I wrote works as expected and finishes in a more or less reasonable amount of time for subsamples of the data. For example, for the 12-month horizon it takes 25 second to run on a subsample of 10,000 observations and 23 minutes on a subsample of 100,000 observations. But the program ran for over 15 hours without completing when I tried it on the full sample of 12-month returns. I've tried a few things based on discussions in this forum, such as using the command asrol that was written by another user. Doing so on the subsamples of my data set causes Stata to freeze. I also understand that the set memory option has been deprecated in recent versions of Stata. (I've used earlier versions of Stata in my research several years ago.)

    I'm happy to post a sample data set, but I'm not sure it would be helpful because I know the program works on the subsamples. The problem seems to be related to how memory is allocated in Stata. The code fragment below is the one that calculates the moments of returns. I declare the data set to be a panel, drop the irrelevant variables, and then use rolling in the standard way to compute the statistics. I don't see why it takes so long on the full data set.

    Thank you.

    Ron

    so fundid date_ym
    tsset fundid date_ym

    * net or gross returns
    gen ret = ret_net
    *gen ret = ret_gross

    *keep fund_id ret date ticker
    keep fundid ret date_ym fund_id
    drop if ret>=.
    save temp, replace

    timer clear 1
    timer on 1

    rolling avg12=r(mean) sd12=r(sd) skew12=r(skewness) kurt12=r(kurtosis), window(12) clear nodots: summarize ret, detail
    label var avg12 "Average 12 month return (ending current month)"
    label var sd12 "St.dev. of 12 month return (ending current month)"
    label var skew12 "Skewness of 12 month return (ending current month)"
    label var kurt12 "Kurtosis of 12 month return (ending current month)"
    gen date = end
    format date %tm
    save rolling_output12, replace

    timer off 1
    timer list 1

  • #2
    The core of the issue can be experienced with the following dummy code (using timeit from ssc):
    Code:
    clear all
    set obs 1000
    gen x = runiform()
    timeit 1: sum x
    set obs 100000000
    timeit 2: sum x if _n < 1000
    timeit 3: sum x in 1/1000
    timer list
    
    clear
    mata
        timer_clear()
        X = runiform(1000,1)
        
        timer_on(1)
        mean(X)
        timer_off(1)
        
        Xm = J(100000000-1000, 1, .)
        X2 = X\Xm
        
        timer_on(2)
        mean(X[1..1000])
        timer_off(2)
        
        timer()
    end
    Which on my pc led to following timings

    Code:
       1:      0.00 /        1 =       0.0010
       2:      2.37 /        1 =       2.3710
       3:      0.00 /        1 =       0.0010
      4.       .002 /        1 =      .002
      5.          0 /        1 =         0
    Note that in all cases, all that was asked if to perform the sum command on 1000 observations, the only difference is in how much work Stata had to do to find those 1000. The clear conclusion is that Stata is very fast at finding those... unless it's done through an if statement.

    The downside of the rolling command is that the only way it can be as flexible as it is, is by using an if statement, in particular, if inrange(<timevar>, <start>, <end>). It has to do that in every iteration and we just saw that these iterations take longer the larger your dataset. So not only do you get many more iterations, they also each take longer. You can imagine how quickly this gets impossible.

    In your case, there might be a "hacky" solution, inspired by a post from Nick Cox from ages past. I don't know if this is the best solution, but it should work.

    Code:
    clear
    set obs 1000
    egen t = seq(), to(20)
    egen n = seq(), block(20)
    gen return = runiform()
    
    xtset n t
    
    forvalues i = 0/11 {
        gen return_L`i' = L`i'.return
    }
    
    egen nonMissing = rownonmiss(return_L*)
    egen meanReturn = rowtotal(return_L*)
    
    replace meanReturn = meanReturn/nonMissing
    What I've done here is just manually create the windows and then using the formula for the mean to calculate it. You can also use the rowmean() function of egen of course. I don't remember how skewness and kurtosis are calculated, but IIRC their formulas are not complicated. The advantage of this method is that in larger datasets the number of calculations increases, but the time it takes to do each calculation remains the same.

    Comment


    • #3
      Just use rangestat (from SSC). It can generate the desired results on the full sample in less than a minute. Here's a quick example:

      Code:
      clear all
      set obs 3000
      gen long fundid = _n
      expand 360
      bysort fundid: gen date_ym = _n
      gen ret = runiform()
      
      rangestat (count) nobs=ret (mean) avg12=ret (sd) sd12=ret (skewness) skew12=ret ///
          (kurtosis) kurt12=ret, interval(date_ym -35 0) by(fundid)

      Comment


      • #4
        Thanks for the reply and proposed solution. I may try it. For now, I'm going to do the calculations in Python and then export them to Stata to do the data analysis. Python should run significantly faster.

        Comment


        • #5
          Ron, did you miss my point that you can perform these calculations on the full sample in less than 1 minute? Here's a revised example that performs all 3 rolling windows (12, 24, and 36 month) on a test dataset with 1,080,000 observations:

          Code:
          clear all
          set obs 3000
          gen long fundid = _n
          expand 360
          bysort fundid: gen date_ym = _n
          gen ret = runiform()
          
          timer on 1
          rangestat (count) nobs12=ret (mean) avg12=ret (sd) sd12=ret (skewness) skew12=ret ///
              (kurtosis) kurt12=ret, interval(date_ym -11 0) by(fundid)
          rangestat (count) nobs24=ret (mean) avg24=ret (sd) sd24=ret (skewness) skew24=ret ///
              (kurtosis) kurt24=ret, interval(date_ym -23 0) by(fundid)
          rangestat (count) nobs36=ret (mean) avg36=ret (sd) sd36=ret (skewness) skew36=ret ///
              (kurtosis) kurt36=ret, interval(date_ym -35 0) by(fundid)
          timer off 1
          timer list
          And here is the timer result:
          Code:
          . timer list
             1:     89.79 /        1 =      89.7870
          All done!

          Comment


          • #6
            Originally posted by Robert Picard View Post
            [...]
            I knew there would be an easier solution...

            Comment


            • #7
              Hi Robert,

              Indeed I did. My earlier post was directed at Jesse. I saw your post after I responded to him.

              Your solution worked well. The running time on the full data set for all three horizons was 69 seconds. And it didn't require me to go between Stata and Python, which would've been inconvenient. Thank you very much for your help.

              Ron

              Comment


              • #8
                Ron, I appreciate you providing this valuable information. I have a very silly question. I followed your codes and got extra variables such as nobs12, avg12, sd12, skew12, and kurt12, and so on. If I need summary stats from those new variables, could I use just Summarize * simply? I am currently dealing with some panel data and do want to get summary stats firm by firm.

                Comment

                Working...
                X