Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Terrible parallelization performance in Mata

    I just ran an example from the SJ paper about boottest ("Fast and Wild") and was surprised at its poor performance. I'm fortunate enough to be working in Stata/MP, and I discovered that the more processors I enabled it to use, the slower it got. I'm using a Dell XPS 17 9700, which has a pretty good cooling system. Its CPU is an Intel i7-10875H, which has 8 cores and hyperthreading. I'm running Stata/MP 12-core 16.1. It's got 64GB of RAM and Windows 10 Pro.

    Here is the log from a distilled demonstration. It sets the number of cores to 1, 2, ..., 12. On each iteration it calls a program that creates a 2500 x 1 matrix X and then computes X + X :* X 10,000 times. Simplifying that calculation to X + X or X :* X makes the problem go away.

    I'm wondering if anyone else with access to Stata/MP gets similar results, or has insights. Possibly it doesn't happen on all computers. I understand that implementing invisible parallelization in a compiler is a tricky business. But Stata/MP doesn't come cheap!

    Code:
    cap mata mata drop demo()
    
    mata
    mata set matastrict on
    mata set matalnum off
    mata set mataoptimize on
    
    void demo() {
        real matrix X; real scalar i
        X = runiform(2500,1)
        for (i=10000; i; i--)
            (void) X + X :* X
    }
    end
    
    timer clear
    forvalues p=1/12 {
      qui set processors `p'
      set seed 1202938431
      timer on `p'
      mata demo()
      timer off `p'
    }
    timer list
    Output:
    Code:
    . timer list
       1:      0.14 /        1 =       0.1390
       2:      0.16 /        1 =       0.1640
       3:      1.63 /        1 =       1.6330
       4:      2.02 /        1 =       2.0150
       5:      2.47 /        1 =       2.4680
       6:      2.92 /        1 =       2.9210
       7:      3.38 /        1 =       3.3780
       8:      3.84 /        1 =       3.8370
       9:      4.26 /        1 =       4.2640
      10:      4.70 /        1 =       4.7040
      11:      5.21 /        1 =       5.2100
      12:      5.63 /        1 =       5.6260
    That's right: using 1 core takes 0.14 seconds. Using 8 cores takes 3.84 seconds. Using 12 (with hyperthreading) takes 5.63 seconds.

    Here's output I get from Stata 15.0--it's actually better! But still bad:
    Code:
       1:      0.13 /        1 =       0.1280
       2:      0.14 /        1 =       0.1440
       3:      1.09 /        1 =       1.0900
       4:      1.35 /        1 =       1.3460
       5:      1.55 /        1 =       1.5540
       6:      1.78 /        1 =       1.7810
       7:      2.10 /        1 =       2.1010
       8:      2.35 /        1 =       2.3540
       9:      2.59 /        1 =       2.5920
      10:      2.89 /        1 =       2.8890
      11:      3.16 /        1 =       3.1570
      12:      3.41 /        1 =       3.4120
    I monitored CPU usage during these tests and saw no evidence of throttling.

    I'm worried that my Mata-based programs are getting seriously slowed down.

    If you've got MP and can run this test, I'd be interested in the results.
    Last edited by David Roodman; 30 Nov 2020, 20:19.

  • #2
    Here are timings from Stata 16.1/MP-4 on some Windows server I have access too (don't ask me what kind exactly; I could find out, though, if necessary). I observe a similar pattern.

    Code:
    cap mata mata drop demo()
    
    mata
    mata set matastrict on
    mata set matalnum off
    mata set mataoptimize on
    
    void demo() {
        real matrix X; real scalar i
        X = runiform(2500,1)
        for (i=10000; i; i--)
            (void) X + X :* X
    }
    end
    
    timer clear
    forvalues p=1/4 {
      qui set processors `p'
      set seed 1202938431
      timer on `p'
      mata demo()
      timer off `p'
    }
    timer list
    Output:

    Code:
    . timer list
       1:      0.15 /        1 =       0.1480
       2:      0.20 /        1 =       0.1980
       3:      1.32 /        1 =       1.3230
       4:      1.71 /        1 =       1.7140
    I also do observe that the problem goes away if simplifying the computation to X+X or X:*X. However, doing the computation in two steps does not seem to help:

    Code:
    cap mata mata drop demo()
    
    mata
    mata set matastrict on
    mata set matalnum off
    mata set mataoptimize on
    
    void demo() {
        real matrix X, Y; real scalar i
        X = runiform(2500,1)
        for (i=10000; i; i--) {
            Y = X :* X
            Y = X + Y
        }
    }
    end
    
    timer clear
    forvalues p=1/4 {
      qui set processors `p'
      set seed 1202938431
      timer on `p'
      mata demo()
      timer off `p'
    }
    timer list
    Output:

    Code:
    . timer list
       1:      0.15 /        1 =       0.1550
       2:      0.20 /        1 =       0.1970
       3:      1.27 /        1 =       1.2720
       4:      1.73 /        1 =       1.7280
    ben

    Comment


    • #3
      I am able to reproduce on my Windows machine.

      Code:
      . timer list
         1:      0.16 /        1 =       0.1630
         2:      0.46 /        1 =       0.4560
         3:      3.22 /        1 =       3.2200
         4:      4.14 /        1 =       4.1410
      And as Ben Jann observed, that change code to
      Code:
           for (i=10000; i; i--) {        
                Y = X :* X        
                Y = X + Y    
           }
      does not help. But if you separate them into two loops, the problem goes away:

      Code:
      mata:
      void demo1() {
          real matrix X; real scalar i
          X = runiform(2500,1)
          for (i=10000; i; i--) {
              (void) X :* X
          }
      
          for (i=10000; i; i--) {
              (void) X + X
          }    
      }
      end
      
      . timer list
         1:      0.15 /        1 =       0.1460
         2:      0.47 /        1 =       0.4710
         3:      0.48 /        1 =       0.4780
         4:      0.47 /        1 =       0.4650
      Note that the size of the problem size is tiny. Hence the overhead can overpower the benefit of parallelization, and adding more cores can make the performance worse. But something is definitely going on given that X + X and X*X do not appear to have the issue. Anyway, we will investigate and report back.
      Last edited by Hua Peng (StataCorp); 01 Dec 2020, 09:35.

      Comment


      • #4
        Thank you, Hua. I agree the individual calculations are small. Still, I think the use case is realistic: one may repeat the same calculation many times on small data sets for Monte Carlo or bootstrapping purposes. So I'm glad you're investigating. This example is derived from the wild2() program in "Fast and Wild," which I think is pretty realistic and might provide a good test bed. I think there are at least 3 lines in that little program exhibiting the same behavior.

        Comment


        • #5
          This is not the first time this issue has arisen on Statalist. I was able to locate a previous topic I participated in (which did not draw the attention of anyone from StataCorp) at

          https://www.statalist.org/forums/for...-stata-mp-15-1

          and this topic also involved Mata's performance on mutiprocessor systems.

          In post #16 at the top of the second page of this earlier topic I summarized my conclusions as "multiprocessing is slow in long loops of small calculations" due to the overhead of setting up the multiple processes. Certainly the Stata 16 experience reported in post #1 of today's topic suggests each additional processor utilized beyond the second requires 0.4 seconds of setup time; all this for a calculation that takes but 0.13 seconds on a single processor. My conclusion in the previous topic was that it "suggests that when using Stata/MP an early step in evaluating poor performance should be to run it on a single processor and see if your code is perhaps spending too much time preparing small tasks for multiprocessing."

          From today's topic I learn that, based on the experience with the simplifications, Stata seems to apply heuristics to try to determine if the gain in performance is likely to be worth the pain of multiprocessing. It gets the right answer for the simplified expressions, and the wrong answer for the slightly-more-complex expressions.

          I'm looking forward to what light StataCorp can shed on this. Certainly, the lesson is that Stata/MP does not, and realistically can not, guarantee that run times on multiple processors are bounded above by single processor performance. What we can hope is that StataCorp is able to incrementally improve the heuristics that make the choice to utilize additional processors.

          Added in edit: Another earlier topic can be found at

          https://www.statalist.org/forums/for...timing-mystery

          which I mention only because it seems to support a recollection of mine - which I have not been able to track down on Statalist - of a performance issue of some sort that only affected Stata for Windows, because the task required OS support and linux and macOS handle that task more efficiently than does Windows.
          Last edited by William Lisowski; 01 Dec 2020, 12:20.

          Comment


          • #6
            Ok, I think I found the problem. For Mata's colon operator (help m2_op_colon) in Stata/MP, we have a bug in the setup routine for the number of threads to use when the number of cores/processors available is larger than or equal to 4. In the case David found, the size of the matrix is 2500x1, Stata/MP should only use 2 threads even if the number of cores/processors available is larger than 2. But due to the bug, Stata/MP launches all 12 threads (numbers allowed by David's machine and license). And only the first two are used for calculation, and all the rest are just launched and go away. Hence the large overhead.

            Note: the bug only negatively affects the performance of Mata colon operator for small size problems on Stata/MP with more than 2 core/processors. The numeric results are not affected, i.e., the results are correct.

            We will get this fixed in a future Stata update.
            Last edited by Hua Peng (StataCorp); 01 Dec 2020, 13:35.

            Comment


            • #7
              Hua Peng (StataCorp) In my case, would it always be launching 12 threads, or p threads, where p is set by the loop in the demo? If it always launched 12, then I wouldn't expect the time cost to rise steadily with p as it does on my computer.

              Comment


              • #8
                Thanks, William Lisowski for the links to the old posts, including one that was mine that I completely forgot! I found it a very interesting read...
                Last edited by David Roodman; 01 Dec 2020, 14:36.

                Comment


                • #9
                  David Roodman, no, it will always launch the number of processors available. If -set processors 4-, the number of processors available will become 4 instead of 12.

                  Comment


                  • #10
                    The issue is fixed in today's update. Type:

                    Code:
                    update all
                    to apply the update.

                    Comment


                    • #11
                      Excellent, thanks!

                      Comment

                      Working...
                      X