Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fastest way to calculate the column sums by panels in Mata

    What is the fastest way to calculate the column sums by panels (IDs) in Mata? I use this in a panel maximum likelihood estimation algorithm, and I currently do it as shown below. While it does the job, it is slow especially since the data is huge and the calculation loop runs over and over again for finding the maximum likelihood. Is there a faster way to do this?

    This is a simplified single-run example of the algorithm:
    Code:
    clear
    set obs 1000000
    gen id=_n
    expand 10
    sort id
    gen var1 = runiform(0,1)
    gen var2 = runiform(0,1)
    gen var3 = runiform(0,1)
    
    timer clear 1
    timer on 1
    
    mata {
        id = st_data(.,"id")
        big = (st_data(.,"var1"), st_data(.,"var2"), st_data(.,"var3"))
        V = panelsetup(id, 1)
        
        bigtot = J(rows(id),3,.)
        
        // this takes too much time -->
        for (i=1; i<=rows(V); i++) {
            X1 = panelsubmatrix(big, i, V)
            bigtot[V[i,1]::V[i,2],.]=J(rows(X1),1,colsum(X1))
        } //<--
    }
    
    timer off 1
    timer list 1

  • #2
    Having done a code in a similar spirit before I think the biggest bottle neck on your code is loading the data very single time.
    perhaps you can load the data first and then do the subpanel process
    alternatively perhaps st_view may be more efficient in updating the data (assuming the data changes every time)
    hth
    fernando

    Comment


    • #3
      I am thinking that the reason why it takes Mata too long to complete this job is that the for loop inside the Mata code is not parallelized. I use an MP version of Stata, and a function such as "colsum() by panel" should be parallelize-able, but when it is implemented within a for loop, the code just does it step-by-step. What do you think?

      Comment


      • #4
        What takes time is maintaining/updating the matrix bigtot (16 out off almost 20 sec.), not colsum.
        Actually the colsum part is the fastest.
        A way of speeding things up could be to do the necessary calculations within the loop without saving in bigtot.

        Code:
        . /*
        > clear
        > set obs 1000000
        > gen id=_n
        > expand 10
        > sort id
        > gen var1 = runiform(0,1)
        > gen var2 = runiform(0,1)
        > gen var3 = runiform(0,1)
        > */
        . mata mata clear
        . mata:
        ------------------------------------------------- mata (type end to exit) ------------------------------------------------------------------------------------------------------------
        :     id = st_data(.,"id")
        :     big = (st_data(.,"var1"), st_data(.,"var2"), st_data(.,"var3"))
        :     V = panelsetup(id, 1)
        :     
        :     bigtot = J(rows(id),3,.)
        :     
        :     timer_clear()
        :     timer_on(1)
        :     // this takes too much time -->
        :     R = rows(V)
        :     for (i=1; i<=R; i++) {
        >         timer_on(11)
        >         X1 = panelsubmatrix(big, i, V)
        >         timer_off(11)
        >         timer_on(12)
        >         tmp=J(rows(X1),1,colsum(X1))
        >         timer_off(12)
        >         timer_on(13)
        >         bigtot[V[i,1]::V[i,2],.]=tmp
        >         timer_off(13)
        >     } //<--
        :     timer_off(1)
        :     timer()
        --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        timer report
          1.       19.8 /        1 =    19.753
         11.       1.75 /  1000000 =  1.75e-06
         12.       1.04 /  1000000 =  1.04e-06
         13.         16 /  1000000 =   .000016
        --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        : end
        kind regards
        nhb
        Kind regards

        nhb

        Comment


        • #5
          Niels, you are right about that it is the bigtot matrix within the Mata for loop that takes a lot. But what Cyrus is asking for requires that matrix to be constructed somehow. I think what Cyrus wants is a Mata equivalent of "bys id: egen double bigtot=total(var1)". The problem is that in Stata, "bys id: egen ....=total( )" is parallelized, so if you use Stata MP, the process would be very fast. As far as I know, Mata does not have an equivalent of "bys id: egen ....=total( )" so the process cannot be parallelized in Mata. Technically, it should be possible to get the same outcome purely in Mata, but as far as I know (and I'll double-check about this with the Stata tech service), Mata does not directly provide functionality for parallelized panels or parallel for loops in Mata. Not yet. But below are some workarounds.

          For simplicity, I dropped var2 and var3. When I run the code below on my computer with 6 processors Stata MP, it takes Stata's egen 3 second to generate bigtot. It takes Cyrus' Mata loop 34 seconds to generate bigtot. Using -parallel- with Cyrus' Mata loop decreases that time to 20 seconds. Calling Stata's egen from within Mata takes only 5.2 seconds. I think the last workaround is good enough especially if Cyrus wants to keep everything inside Mata.

          Code:
          . timer list
             1:      3.07 /        1 =       3.0670
             2:     34.07 /        1 =      34.0710
             3:     19.87 /        1 =      19.8710
             4:      5.20 /        1 =       5.2000
          Code:
          clear
          set obs 1000000
          gen id=_n
          expand 20
          sort id
          gen var1 = runiform(0,1)
          
          cap program drop coltot
          program define coltot
          {
              mata {
                  id = st_data(.,"id")
                  big = st_data(.,"var1")
                  V = panelsetup(id, 1)
                  
                  bigtot = J(rows(id),1,.)
                  
                  for (i=1; i<=rows(V); i++) {
                      X1 = panelsubmatrix(big, i, V)
                      bigtot[V[i,1]::V[i,2],.]=J(rows(X1),1,colsum(X1))    
                  }
                  
                  st_addvar("double", "bigtot")
                  st_store(., "bigtot", bigtot)
              }
          }
          end
          
          cap program drop coltot2
          program define coltot2
          {
              mata {
                  id = st_data(.,"id")
                  big = st_data(.,"var1")
                  
                  (void) st_addvar("double", bigcopy=st_tempname())
                  st_store(., bigcopy, big)
                  stata("bys id: egen double bigtot4=total(" + bigcopy + ")")
              }
          }
          end
          
          timer clear
          cap drop big*
          
          timer on 1
          bys id: egen double bigtot1=total(var1)
          timer off 1
          
          timer on 2
          coltot
          timer off 2
          ren bigtot bigtot2
          
          timer on 3
          parallel setclusters 6 //number of processors you have
          parallel clean, all force
          sort id
          parallel, by(id) prog(coltot): coltot
          timer off 3
          ren bigtot bigtot3
          
          timer on 4
          coltot2
          timer off 4
          
          count if bigtot1!=bigtot2
          count if bigtot2!=bigtot3
          count if bigtot3!=bigtot4
          
          cap timer clear 97
          cap timer clear 98
          cap timer clear 99
          timer list

          Comment


          • #6
            Mustafa Ugur Karakaplan: This is quite interesting. Thank you very much for presenting this.

            It is not that I would start a long discussion on this.
            But, in my experience a matrix of that size is usually a mean to an end, not the goal itself.
            If the handling later requires such a matrix, my first comment would be that the overall design might be bad.
            Looking at the whole process, stepwise calculations and later aggregating might be a faster solution.

            Parallel computing should never be the solution to bad coding! (Not that this necessarily is the case here)
            Kind regards

            nhb

            Comment


            • #7
              This is an interesting topic, thank you very much for this.

              I am a bit surprised that the mata code takes so long, but I can confirm that egen is faster even on Stata 16 SE. I am surprised because my experience so far is that well written Mata code runs much faster than Stata code. Moving from Stata to Mata improved the speed of some of my programs by a factor 3 or 4, especially with large data.

              In addition, the parallel solution is great, but usually there are some costs of invoking it. Thus for smaller tasks parallel computing becomes almost too expensive.

              Comment


              • #8
                I think the difference is made by the use of range subscripts. My code is faster than egen, though marginally

                Code:
                clear
                set obs 500000
                gen id=_n
                expand 20
                sort id
                gen var1 = runiform(0,1)
                
                cap program drop coltot
                program define coltot
                {
                    mata {
                        id = st_data(.,"id")
                        big = st_data(.,"var1")
                        panel = panelsetup(id, 1)
                        
                        bigtot = J(rows(id),1,.)
                        rows = rows(panel)
                        
                        for (i=1; i<=rows; i++) {
                            X1 = big[|panel[i,1],1 \ panel[i,2],1|]
                            bigtot[|panel[i,1],1 \ panel[i,2],1|]=J(rows(X1),1,colsum(X1))    
                        }
                        
                        st_addvar("double", "bigtot")
                        st_store(., "bigtot", bigtot)
                    }
                }
                end
                
                
                timer clear
                cap drop big*
                
                timer on 1
                bys id: egen double bigtot1=total(var1)
                timer off 1
                
                timer on 2
                coltot
                timer off 2
                
                
                timer list
                
                   1:      5.29 /        1 =       5.2870
                   2:      4.50 /        1 =       4.5010
                Last edited by Attaullah Shah; 01 Apr 2020, 12:12.
                Regards
                --------------------------------------------------
                Attaullah Shah, PhD.
                Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
                FinTechProfessor.com
                https://asdocx.com
                Check out my asdoc program, which sends outputs to MS Word.
                For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

                Comment


                • #9
                  Sorry for repeating myself, this solution is super interesting.

                  Is there a reason why the range subscripts are so much faster?

                  Comment


                  • #10
                    This is indeed interesting.
                    Looking at the function panelsubmatrix, it is clear that it is based on code similar to Attaullah Shah 's
                    Code:
                    *! version 1.0.1  01jun2013
                    version 9.0
                    mata:
                    
                    /*      
                            X = panelsubmatrix(V, i, info)
                                    return matrix containing i-th panel.
                    */
                    
                    
                    matrix panelsubmatrix(matrix V, real scalar i, real matrix info)
                    {
                            return(V[|info[i,1], 1 \ info[i,2], .|])
                    }
                            
                    end
                    And that part was not slow in my previous code.

                    I did a bit more testing:
                    Code:
                    cls
                    /*
                    clear
                    set obs 1000000
                    gen id=_n
                    expand 10
                    sort id
                    gen var1 = runiform(0,1)
                    gen var2 = runiform(0,1)
                    gen var3 = runiform(0,1)
                    */
                    
                    timer clear
                    
                    timer on 1
                    mata mata clear
                    mata:
                        id = st_data(.,"id")
                        big = st_data(.,"var?")
                        panel = panelsetup(id, 1)
                        
                        bigtot = J(rows(id),3,.)
                        rows = rows(panel)
                        for (i=1; i<=rows; i++) {
                            X1 = big[|panel[i,1],1 \ panel[i,2],3|]
                            bigtot[|panel[i,1],1 \ panel[i,2],3|]=J(rows(X1),1,colsum(X1))    
                        }
                    end
                    timer off 1
                    
                    timer on 2
                    mata mata clear
                    mata:
                        id = st_data(.,"id")
                        big = st_data(.,"var?")
                        panel = panelsetup(id, 1)
                        
                        bigtot = J(rows(id),3,.)
                        rows = rows(panel)
                        for (i=1; i<=rows; i++) {
                            //X2 = big[|panel[i,1],1 \ panel[i,2],3|]
                            X2 = panelsubmatrix(big, i, panel)
                            bigtot[|panel[i,1],1 \ panel[i,2],3|]=J(rows(X2),1,colsum(X2))    
                        }
                    end
                    timer off 2
                    
                    timer on 3
                    mata mata clear
                    mata:
                        id = st_data(.,"id")
                        big = st_data(.,"var?")
                        panel = panelsetup(id, 1)
                        
                        bigtot = J(rows(id),3,.)
                        rows = rows(panel)
                        for (i=1; i<=rows; i++) {
                            //X3 = big[|panel[i,1],1 \ panel[i,2],3|]
                            X3 = panelsubmatrix(big, i, panel)
                            //bigtot[|panel[i,1],1 \ panel[i,2],3|]=J(rows(X2),1,colsum(X2))
                            bigtot[panel[i,1]..panel[i,2],.] =J(rows(X3),1,colsum(X3))
                        }
                    end
                    timer off 3
                    timer list
                    And the times I got was:
                    Code:
                    . timer list
                       1:      7.77 /        1 =       7.7730
                       2:      7.83 /        1 =       7.8290
                       3:     47.12 /        1 =      47.1230
                    So JanDitzen is right, the range subscript, but mainly at the left side of =.
                    This is also described in https://www.stata-journal.com/sjpdf....iclenum=pr0028 , section 4 by W. Gould himself.

                    I forgot, so it is very nice, that someone remembers.

                    Still, I ran Attaullah Shah's code as well and got:
                    Code:
                    . timer list
                       1:      3.01 /        1 =       3.0080
                       2:      6.04 /        1 =       6.0420
                    So, on Stata MP version 16.1 Stata is still twice as fast as Mata
                    Last edited by Niels Henrik Bruun; 01 Apr 2020, 15:52.
                    Kind regards

                    nhb

                    Comment


                    • #11
                      I have been thinking about this topic. I emailed Ben Jann about it and I also asked Stata Tech Service about the issue. Ben emailed me some options and I edited them further. I also heard back from StataCorp. They mentioned the existence of an undocumented Mata function called panelsum. Below are several options to calculate column sums by panels. I also tried using them in a maximum likelihood estimation algorithm with big data -just like how Cyrus explained in the original post. Based on my tests, the Mata routine with the undocumented panelsum function is the fastest of all. The amount of time that it saves is substantial especially since the ML estimation calls it repeatedly who knows how many times and all the data to be evaluated change in every call. Note that the panelsum routine does not have a step-by-step for-loop here, but I haven't checked if there is one behind the scenes. Also, as Ben noted, looking at the timer list, the gen routine is actually the fastest (faster than egen and gegen) but since the routine requires going back and forth between Stata and Mata, that slows down the ML estimation algorithm written in Mata. All in all, I think the best solution to Cyrus' question is the panelsum routine here -if you guys cannot come up with a better one .

                      Code:
                      . timer list
                         1:     34.41 /        1 =      34.4140
                         2:     14.37 /        1 =      14.3680
                         3:     12.97 /        1 =      12.9680
                         4:     14.96 /        1 =      14.9650
                         5:      8.20 /        1 =       8.2020
                         6:      1.03 /        1 =       1.0270
                         7:      3.32 /        1 =       3.3200
                         8:      3.16 /        1 =       3.1640
                      .
                      . // 1: sorting
                      . // 2: coltot2 using panelsetup() for loop
                      . // 3: coltot3 using panelsum()
                      . // 4: coltot4 using mm_panels() for loop
                      . // 5: egen
                      . // 6: gen
                      . // 7: gegen, by
                      . // 8: by: gegen
                      Code:
                      clear all
                      
                      // programs
                      mata:
                          void coltot2() {
                              id = st_data(.,"id")
                              big = st_data(.,"var1")
                              V = panelsetup(id, 1)
                              bigtot = J(rows(id),1,.)
                              for (i=rows(V); i; i--) {
                                  X1 = panelsubmatrix(big, i, V)
                                  bigtot[|V[i,1],.\V[i,2],.|]=J(rows(X1),1,colsum(X1))    
                              }
                              st_addvar("double", "bigtot1")
                              st_store(., "bigtot1", bigtot)
                          }
                      
                          void coltot3() {
                              ID = st_data(.,"ID") //just for ensuring that ID has no gap (1,2,3...)
                              big = st_data(.,"var1")
                              V = panelsetup(ID, 1)
                              Xt = panelsum(big,V) //the undocumented panelsum function
                              bigtot = Xt[ID,]
                              st_addvar("double", "bigtot2")
                              st_store(., "bigtot2", bigtot)
                          }
                      
                          void coltot4() {
                              id = st_data(.,"id")
                              big = st_data(.,"var1")
                              V = _mm_panels(id) //Ben Jann's moremata
                              bigtot = J(rows(id),1,.)
                              b = rows(id)
                              for (i = rows(V); i; i--) {
                                  k = V[i]
                                  a = b - k + 1
                                  X1 = big[|a,.\b,.|]
                                  bigtot[|a,.\b,.|] = J(k, 1, colsum(X1))
                                  b = a - 1
                              }
                              st_addvar("double", "bigtot3")
                              st_store(., "bigtot3", bigtot)
                          }
                      end
                      
                      
                      // generate data
                      set obs 1000000
                      gen id=_n
                      expand 50
                      gegen ID=group(id) //this is for coltot2
                      gen var1 = runiform(0,1)
                      
                      timer clear
                       
                      // sort
                      timer on 1
                      sort id
                      timer off 1
                       
                      // coltot2
                      timer on 2
                      mata: coltot2()
                      timer off 2
                      
                      // coltot3
                      timer on 3
                      mata: coltot3()
                      timer off 3
                      
                      // coltot4
                      timer on 4
                      mata: coltot4()
                      timer off 4
                      
                      // egen
                      timer on 5
                      by id: egen double bigtot4=total(var1)
                      timer off 5
                       
                      // gen
                      timer on 6
                      by id: gen double bigtot5=sum(var1)
                      by id: replace bigtot5=bigtot5[_N]
                      timer off 6
                      
                      // gegen
                      timer on 7
                      gegen double bigtot6=total(var1), by(id) replace
                      timer off 7
                      
                      // gegen
                      timer on 8
                      by id: gegen double bigtot7=total(var1), replace
                      timer off 8
                       
                      // check results
                      assert (bigtot1==bigtot2)
                      assert (bigtot2==bigtot3)
                      assert (bigtot3==bigtot4)
                      assert (bigtot4==bigtot5)
                      assert (bigtot5==bigtot6)
                      assert (bigtot6==bigtot7)
                      
                       
                      // timer
                      timer list
                      
                      // 1: sorting
                      // 2: coltot2 using panelsetup() for loop
                      // 3: coltot3 using panelsum()
                      // 4: coltot4 using mm_panels() for loop
                      // 5: egen
                      // 6: gen
                      // 7: gegen, by
                      // 8: by: gegen

                      Comment


                      • #12
                        Mustafa Ugur Karakaplan: Interesting. Nice work
                        Kind regards

                        nhb

                        Comment


                        • #13
                          I am still fascinated by this topic and how much of a time difference different commands make. I am working on panel time series estimators and thus with a "large" number of observations over time time and units ("id"). I recently recoded one of my commands entirely in mata which made it much faster. During this work I tried to optimize the code so it runs fastest in Monte Carlo simulations as well. I realised that one of the bottle necks is actually often the data generating process too.

                          With this in mind I am interested in which role the number of observations of id and t play and I did the following small experiment. I vary the number of observations over ids and time between: 100, 200, 500, 1000 and 10000. Then I compare the timings of the egen, gen approach and the coltot2 and coltot3 programs.

                          The results for the timings are:
                          Code:
                          egen
                                       1         2         3         4         5         6
                              +-------------------------------------------------------------+
                            1 |      N\T       100       200       500      1000     10000  |
                            2 |      100     0.017     0.028     0.066     0.109     0.957  |
                            3 |      200     0.015     0.036     0.080     0.176     1.984  |
                            4 |      500     0.044     0.080     0.210     0.435     5.203  |
                            5 |     1000     0.080     0.171     0.448     0.943    10.960  |
                            6 |    10000     0.969     2.000     5.262    11.068   125.830  |
                              +-------------------------------------------------------------+
                          gen
                                     1       2       3       4       5       6
                              +-------------------------------------------------+
                            1 |    N\T     100     200     500    1000   10000  |
                            2 |    100   0.002   0.002   0.007   0.013   0.098  |
                            3 |    200   0.002   0.003   0.009   0.019   0.174  |
                            4 |    500   0.004   0.008   0.022   0.039   0.437  |
                            5 |   1000   0.010   0.015   0.038   0.077   0.872  |
                            6 |  10000   0.074   0.152   0.376   0.759   7.500  |
                              +-------------------------------------------------+
                          coltot2
                                      1        2        3        4        5        6
                              +-------------------------------------------------------+
                            1 |     N\T      100      200      500     1000    10000  |
                            2 |     100    0.009    0.007    0.017    0.033    0.319  |
                            3 |     200    0.010    0.013    0.031    0.067    0.623  |
                            4 |     500    0.016    0.032    0.076    0.153    1.553  |
                            5 |    1000    0.032    0.061    0.153    0.300    3.081  |
                            6 |   10000    0.317    0.611    1.508    3.036   38.563  |
                              +-------------------------------------------------------+
                          coltot3
                                      1        2        3        4        5        6
                              +-------------------------------------------------------+
                            1 |     N\T      100      200      500     1000    10000  |
                            2 |     100    0.004    0.007    0.017    0.036    0.305  |
                            3 |     200    0.008    0.012    0.032    0.068    0.595  |
                            4 |     500    0.015    0.031    0.076    0.153    1.562  |
                            5 |    1000    0.031    0.061    0.157    0.300    3.100  |
                            6 |   10000    0.306    0.601    1.479    2.975   44.182  |
                              +-------------------------------------------------------+
                          I think it is interesting - if not fascinating - how much faster the combination of by and gen is, especially in comparison to egen. The mata solution is slower in all cases as well, but still faster than egen. The speed of egen is not referred to in the manual - at least I was not able to find it.

                          This discussion is immensely helpful for me. As pointed out, I am working with simulations and often find Stata a bit on the slow side, especially with "large" data. It would be really helpful to have a guide on "fast" programming for large datasets (any volunteers to help me :p).

                          Appendix:
                          The code is the following (it is a bit ad-hoc....):
                          Code:
                          clear all
                          
                          mata:
                              void coltot2() {
                                  id = st_data(.,"id")
                                  big = st_data(.,"var1")
                                  V = panelsetup(id, 1)
                                  bigtot = J(rows(id),1,.)
                                  for (i=rows(V); i; i--) {
                                      X1 = panelsubmatrix(big, i, V)
                                      bigtot[|V[i,1],.\V[i,2],.|]=J(rows(X1),1,colsum(X1))    
                                  }
                                  st_addvar("double", "bigtot3")
                                  st_store(., "bigtot3", bigtot)
                              }
                          
                               void coltot3() {
                                  ID = st_data(.,"id") //just for ensuring that ID has no gap (1,2,3...)
                                  big = st_data(.,"var1")
                                  V = panelsetup(ID, 1)
                                  Xt = panelsum(big,V) //the undocumented panelsum function
                                  bigtot = Xt[ID,]
                                  st_addvar("double", "bigtot4")
                                  st_store(., "bigtot4", bigtot)
                              }
                          end
                          
                          mata results = J(0,7,.)
                          
                          qui{
                              foreach i in 100 200 500 1000 10000  {
                                  foreach t in 100 200 500 1000 10000  {
                                      clear
                                      ** set seed in loop so drawn numbers are the same
                                      set seed 123
                                      timer clear
                                      timer on 99
                                      set obs `i'
                                      gen id = _n
                                      expand `t'
                                      by id, sort: gen t = _n
                          
                                      drawnorm var1
                          
                                      timer on 1
                                      by t (id), sort: egen bigtot1 = sum(var1)
                                      timer off 1
                          
                                      timer on 2
                                      by t  (id), sort: gen double bigtot2=sum(var1)
                                      by t  (id), sort: replace bigtot2=bigtot2[_N]
                                      timer off 2
                                      
                                      sort id t
                                      
                                      timer on 3
                                      mata: coltot2()
                                      timer off 3
                                      
                                      timer on 4
                                      mata: coltot3()
                                      timer off 4
                                      
                                      timer off 99
                                      
                                      noi disp "Results for `i', `t'"
                                      noi timer list
                                      mata results = results \ (`i',`t',`r(t1)',`r(t2)',`r(t3)',`r(t4)',`r(t99)')
                                  }
                              }
                          }
                          
                          clear
                          getmata (res*)= results
                          
                          rename res1 id
                          rename res2 t
                          rename res3 egen
                          rename res4 gen
                          rename res5 coltot2
                          rename res6 coltot3
                          
                          qui{
                              foreach type in egen gen coltot2 coltot3 {
                                  preserve
                                      keep id t `type'
                                      rename `type'* t_*
                                      noi disp "`type'"
                                      reshape wide t_ , i(id) j(t)
                                      putmata res = (id t_*), replace
                                      
                                      mata res2 = ("N\T","100","200","500","1000","10000") \ (strofreal(res[.,1]) , strofreal(res[.,2..6],"%9.3f"))
                                      
                                      noi mata res2
                                  
                                  restore
                                  
                              }
                          }

                          Comment


                          • #14
                            Cyrus Levy I think I should point out that one way to speed up to the loop without doing anything too complicated is to use range subscripts instead of V[i, 1]::V[i, 2], which creates a temporary array at each run of the loop. I will also point out that reading the data into mata takes a while (a few seconds in my system).

                            Code:
                            clear
                            set obs 1000000
                            gen id = _n
                            expand 10
                            sort id
                            gen var1 = runiform(0,1)
                            gen var2 = runiform(0,1)
                            gen var3 = runiform(0,1)
                            
                            timer clear
                            
                            timer on 1
                            mata {
                                id     = st_data(.,"id")
                                big    = (st_data(.,"var1"), st_data(.,"var2"), st_data(.,"var3"))
                                V      = panelsetup(id, 1)
                            }
                            timer off 1
                            
                            timer on 2
                            mata {
                                ncol   = 3
                                bigtot = J(rows(id), ncol, .)
                                for (i = 1; i <= rows(V); i++) {
                                    X1 = panelsubmatrix(big, i, V)
                                    bigtot[|V[i,1], 1 \ V[i, 2], ncol|] = J(rows(X1), 1, colsum(X1))
                                }
                            }
                            timer off 2
                            
                            timer on 3
                            mata {
                                ncol   = 3
                                bigtot = J(rows(id), ncol, .)
                                for (i = 1; i <= rows(V); i++) {
                                    X1 = panelsubmatrix(big, i, V)
                                    bigtot[V[i, 1]::V[i, 2], .]=J(rows(X1), 1, colsum(X1))
                                }
                            }
                            timer off 3
                            
                            timer list
                            gives
                            Code:
                            . timer list
                               1:      3.12 /        1 =       3.1190
                               2:      1.41 /        1 =       1.4130
                               3:      7.10 /        1 =       7.1020
                            So you can see it's 5.5x faster to use range subscripts, and it's not too complicated to implement.

                            PS To everyone else: If this is part of an MLE implementation in mata, reading the data into mata, setting up the panel, and back into Stata would be things that have to be done anyway. Hence timing the loop by itself is what really counts, and in that case the loop by itself actually runs faster than all the other solutions if you do 3 variables (comparing single variable solutions was not appropriate since adding variables scales roughly linearly when using gen, egen, etc. but in mata the marginal cost of adding variables is smaller).

                            Comment


                            • #15
                              Bringing this topic back because I have something important to add.

                              I realised it makes a huge difference where to put the sort t id. I updated the code above so that the data is sorted before the egen and gen command. Otherwise the first command would sort the data, while the second command would not repeat this operation because it is already sorted. The updated results (run on a different machine!) still confirm that gen is faster than egen, but by a smaller magnitude (about 50%):

                              Code:
                              egen
                                          1        2        3        4        5        6
                                  +-------------------------------------------------------+
                                1 |     N\T      100      200      500     1000    10000  |
                                2 |     100    0.006    0.006    0.015    0.027    0.147  |
                                3 |     200    0.003    0.006    0.014    0.025    0.254  |
                                4 |     500    0.010    0.013    0.030    0.059    0.701  |
                                5 |    1000    0.014    0.025    0.063    0.126    2.018  |
                                6 |   10000    0.195    0.306    0.904    1.522   17.579  |
                                  +-------------------------------------------------------+
                              gen
                                          1        2        3        4        5        6
                                  +-------------------------------------------------------+
                                1 |     N\T      100      200      500     1000    10000  |
                                2 |     100    0.001    0.003    0.007    0.013    0.087  |
                                3 |     200    0.002    0.003    0.008    0.016    0.154  |
                                4 |     500    0.003    0.008    0.018    0.038    0.366  |
                                5 |    1000    0.008    0.015    0.035    0.077    1.110  |
                                6 |   10000    0.097    0.180    0.520    0.783   10.451  |
                                  +-------------------------------------------------------+
                              coltot2
                                          1        2        3        4        5        6
                                  +-------------------------------------------------------+
                                1 |     N\T      100      200      500     1000    10000  |
                                2 |     100    0.004    0.005    0.011    0.024    0.217  |
                                3 |     200    0.004    0.008    0.019    0.038    0.390  |
                                4 |     500    0.010    0.020    0.080    0.097    1.020  |
                                5 |    1000    0.025    0.043    0.101    0.189    2.660  |
                                6 |   10000    0.266    0.534    1.311    2.095   24.717  |
                                  +-------------------------------------------------------+
                              coltot3
                                          1        2        3        4        5        6
                                  +-------------------------------------------------------+
                                1 |     N\T      100      200      500     1000    10000  |
                                2 |     100    0.002    0.004    0.012    0.023    0.206  |
                                3 |     200    0.006    0.011    0.020    0.040    0.372  |
                                4 |     500    0.011    0.019    0.048    0.103    1.075  |
                                5 |    1000    0.021    0.041    0.104    0.199    2.811  |
                                6 |   10000    0.250    0.529    1.339    2.201   29.699  |
                                  +-------------------------------------------------------+
                              The revised code with the change in bold is:

                              Code:
                              clear all
                              
                              mata:
                                  void coltot2() {
                                      id = st_data(.,"id")
                                      big = st_data(.,"var1")
                                      V = panelsetup(id, 1)
                                      bigtot = J(rows(id),1,.)
                                      for (i=rows(V); i; i--) {
                                          X1 = panelsubmatrix(big, i, V)
                                          bigtot[|V[i,1],.\V[i,2],.|]=J(rows(X1),1,colsum(X1))    
                                      }
                                      st_addvar("double", "bigtot3")
                                      st_store(., "bigtot3", bigtot)
                                  }
                              
                                   void coltot3() {
                                      ID = st_data(.,"id") //just for ensuring that ID has no gap (1,2,3...)
                                      big = st_data(.,"var1")
                                      V = panelsetup(ID, 1)
                                      Xt = panelsum(big,V) //the undocumented panelsum function
                                      bigtot = Xt[ID,]
                                      st_addvar("double", "bigtot4")
                                      st_store(., "bigtot4", bigtot)
                                  }
                              end
                              
                              mata results = J(0,7,.)
                              
                              qui{
                                  foreach i in 100 200 500 1000 10000  {
                                      foreach t in 100 200 500 1000 10000  {
                                          clear
                                          ** set seed in loop so drawn numbers are the same
                                          set seed 123
                                          timer clear
                                          timer on 99
                                          set obs `i'
                                          gen id = _n
                                          expand `t'
                                          by id, sort: gen t = _n
                              
                                          drawnorm var1
                                          
                                          sort t id
                                          
                                          timer on 1
                                          by t (id), sort: egen bigtot1 = sum(var1)
                                          timer off 1
                              
                                          timer on 2
                                          by t  (id), sort: gen double bigtot2=sum(var1)
                                          by t  (id), sort: replace bigtot2=bigtot2[_N]
                                          timer off 2
                                          
                                          sort id t
                                          
                                          timer on 3
                                          mata: coltot2()
                                          timer off 3
                                          
                                          timer on 4
                                          mata: coltot3()
                                          timer off 4
                                          
                                          timer off 99
                                          
                                          noi disp "Results for `i', `t'"
                                          noi timer list
                                          mata results = results \ (`i',`t',`r(t1)',`r(t2)',`r(t3)',`r(t4)',`r(t99)')
                                      }
                                  }
                              }
                              
                              clear
                              getmata (res*)= results
                              
                              rename res1 id
                              rename res2 t
                              rename res3 egen
                              rename res4 gen
                              rename res5 coltot2
                              rename res6 coltot3
                              
                              qui{
                                  foreach type in egen gen coltot2 coltot3 {
                                      preserve
                                          keep id t `type'
                                          rename `type'* t_*
                                          noi disp "`type'"
                                          reshape wide t_ , i(id) j(t)
                                          putmata res = (id t_*), replace
                                          
                                          mata res2 = ("N\T","100","200","500","1000","10000") \ (strofreal(res[.,1]) , strofreal(res[.,2..6],"%9.3f"))
                                          
                                          noi mata res2
                                      
                                      restore
                                      
                                  }
                              }

                              Comment

                              Working...
                              X