I just ran an example from the SJ paper about boottest ("Fast and Wild") and was surprised at its poor performance. I'm fortunate enough to be working in Stata/MP, and I discovered that the more processors I enabled it to use, the slower it got. I'm using a Dell XPS 17 9700, which has a pretty good cooling system. Its CPU is an Intel i7-10875H, which has 8 cores and hyperthreading. I'm running Stata/MP 12-core 16.1. It's got 64GB of RAM and Windows 10 Pro.
Here is the log from a distilled demonstration. It sets the number of cores to 1, 2, ..., 12. On each iteration it calls a program that creates a 2500 x 1 matrix X and then computes X + X :* X 10,000 times. Simplifying that calculation to X + X or X :* X makes the problem go away.
I'm wondering if anyone else with access to Stata/MP gets similar results, or has insights. Possibly it doesn't happen on all computers. I understand that implementing invisible parallelization in a compiler is a tricky business. But Stata/MP doesn't come cheap!
Output:
That's right: using 1 core takes 0.14 seconds. Using 8 cores takes 3.84 seconds. Using 12 (with hyperthreading) takes 5.63 seconds.
Here's output I get from Stata 15.0--it's actually better! But still bad:
I monitored CPU usage during these tests and saw no evidence of throttling.
I'm worried that my Mata-based programs are getting seriously slowed down.
If you've got MP and can run this test, I'd be interested in the results.
Here is the log from a distilled demonstration. It sets the number of cores to 1, 2, ..., 12. On each iteration it calls a program that creates a 2500 x 1 matrix X and then computes X + X :* X 10,000 times. Simplifying that calculation to X + X or X :* X makes the problem go away.
I'm wondering if anyone else with access to Stata/MP gets similar results, or has insights. Possibly it doesn't happen on all computers. I understand that implementing invisible parallelization in a compiler is a tricky business. But Stata/MP doesn't come cheap!
Code:
cap mata mata drop demo() mata mata set matastrict on mata set matalnum off mata set mataoptimize on void demo() { real matrix X; real scalar i X = runiform(2500,1) for (i=10000; i; i--) (void) X + X :* X } end timer clear forvalues p=1/12 { qui set processors `p' set seed 1202938431 timer on `p' mata demo() timer off `p' } timer list
Code:
. timer list 1: 0.14 / 1 = 0.1390 2: 0.16 / 1 = 0.1640 3: 1.63 / 1 = 1.6330 4: 2.02 / 1 = 2.0150 5: 2.47 / 1 = 2.4680 6: 2.92 / 1 = 2.9210 7: 3.38 / 1 = 3.3780 8: 3.84 / 1 = 3.8370 9: 4.26 / 1 = 4.2640 10: 4.70 / 1 = 4.7040 11: 5.21 / 1 = 5.2100 12: 5.63 / 1 = 5.6260
Here's output I get from Stata 15.0--it's actually better! But still bad:
Code:
1: 0.13 / 1 = 0.1280 2: 0.14 / 1 = 0.1440 3: 1.09 / 1 = 1.0900 4: 1.35 / 1 = 1.3460 5: 1.55 / 1 = 1.5540 6: 1.78 / 1 = 1.7810 7: 2.10 / 1 = 2.1010 8: 2.35 / 1 = 2.3540 9: 2.59 / 1 = 2.5920 10: 2.89 / 1 = 2.8890 11: 3.16 / 1 = 3.1570 12: 3.41 / 1 = 3.4120
I'm worried that my Mata-based programs are getting seriously slowed down.
If you've got MP and can run this test, I'd be interested in the results.
Comment