Dear All,
I am trying to understand why I fail to reproduce the figures reported in the StataMP Performance Report with my Stata installation.
Importantly, before I proceed, I wanted to underline that I do not doubt the results presented in the report, which were certainly obtained using a more rigorous approach and verified numerous times. Instead, I am trying to find my own mistake in replicating them.
The report is here:
https://www.stata.com/statamp/performance-report/report.pdf
(Revision 3.4.0 25sep2023)
My test machine is as follows:
OS: Windows 11
CPU: 11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz 1.50 GHz
RAM: 16.0 GB
This is the performance I estimate for the regress command:
Which means that with 4 CPUs I can cut the execution time roughly in 2 (=5.04/2.48) compared to a single CPU, but according to the report mentioned above I should get a value close to 4. (Because of "Stata’s linear regression command, regress, very nearly achieves theoretical limits (see figure 4); its relative speeds increase in almost direct proportion to the number of cores.", see page 8 of the report)
The results shown in the table above are obtained with the following code
And benchmark is a user-written command from here.
Are the performance advantages shown in the report available only for some size of data? or are there any critical overheads that must be considered? Is the timer command not suitable for these measurements? Is there a bug in my code, perhaps?
I am hoping that the size of the task is the cause based on the text (page 15): "...with problems 100 to 10,000 times smaller and run times of 0.4 seconds to just over 4 seconds on a machine running at 2.2–3.4 GHz, substantial speedups were still observed. Among commands that were at least 50% parallelized, more than half exhibited greater than 90% of the speedup exhibited on the larger problems." But since the example I used contained ~22.5mln observations, which is already more than what I can expect in the kind of data I commonly work with.
I tried to replicate the results multiple times and average them, so that I can eliminate the noise coming from other applications running on the computer in parallel to Stata. What else can I possibly try?
I observe the CPU utilization during the test runs and it doesn't reach 100% (over all cores) which gives me some reassurance that it is not the other processes that affect the benchmarking results. I also don't see memory usage going to all available memory giving me some reassurance that it is not the memory congestion that spoils the results.
Page 7 of the report mentions that "To reduce the impact of interruptions by the operating system, the timings were repeated three times and the shortest time was recorded.". Is that more preferrable than averaging over multiple runs?? (taking the min run times still does not reproduce the report's results).
PS: my Stata license is restricted to 4 CPUs, so I couldn't really get it to the higher CPUs parallelization. If anyone has access to a higher-CPUs license, could you please re-run the same code and indicate what it looks like for more processors?
PPS: I need to estimate benefits from parallelization of another command (not from Stata-supplied code), but before I do that, I wanted to make sure that the method I am using is consistent with the results reported in the Stata MP report mentioned above.
PPPS: I am also curios of super-efficient parallelization results for some commands presented in the report (e.g. by: replace) the performance of which grows more, than the number of CPU cores. (e.g. for by:replace 16 CPUs do the work 22.9 times faster than 1 CPU, how comes?? - page 19.)
Thank you, Sergiy Radyakin
I am trying to understand why I fail to reproduce the figures reported in the StataMP Performance Report with my Stata installation.
Importantly, before I proceed, I wanted to underline that I do not doubt the results presented in the report, which were certainly obtained using a more rigorous approach and verified numerous times. Instead, I am trying to find my own mistake in replicating them.
The report is here:
https://www.stata.com/statamp/performance-report/report.pdf
(Revision 3.4.0 25sep2023)
My test machine is as follows:
OS: Windows 11
CPU: 11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz 1.50 GHz
RAM: 16.0 GB
This is the performance I estimate for the regress command:
Code:
Average execution time over 25 runs (in seconds) CPUs Time Ratio 1 5.04 1.00 2 2.99 1.69 3 2.32 2.17 4 2.48 2.03
The results shown in the table above are obtained with the following code
Code:
clear all // Create test dataset set seed 1000 sysuse nlsw88 expand 10000 generate r=runiform()*1000 count // Run the benchmark benchmark , cmd("regress wage grade age tenure r") repeat(25) save(RESULTS)
Are the performance advantages shown in the report available only for some size of data? or are there any critical overheads that must be considered? Is the timer command not suitable for these measurements? Is there a bug in my code, perhaps?
I am hoping that the size of the task is the cause based on the text (page 15): "...with problems 100 to 10,000 times smaller and run times of 0.4 seconds to just over 4 seconds on a machine running at 2.2–3.4 GHz, substantial speedups were still observed. Among commands that were at least 50% parallelized, more than half exhibited greater than 90% of the speedup exhibited on the larger problems." But since the example I used contained ~22.5mln observations, which is already more than what I can expect in the kind of data I commonly work with.
I tried to replicate the results multiple times and average them, so that I can eliminate the noise coming from other applications running on the computer in parallel to Stata. What else can I possibly try?
I observe the CPU utilization during the test runs and it doesn't reach 100% (over all cores) which gives me some reassurance that it is not the other processes that affect the benchmarking results. I also don't see memory usage going to all available memory giving me some reassurance that it is not the memory congestion that spoils the results.
Page 7 of the report mentions that "To reduce the impact of interruptions by the operating system, the timings were repeated three times and the shortest time was recorded.". Is that more preferrable than averaging over multiple runs?? (taking the min run times still does not reproduce the report's results).
PS: my Stata license is restricted to 4 CPUs, so I couldn't really get it to the higher CPUs parallelization. If anyone has access to a higher-CPUs license, could you please re-run the same code and indicate what it looks like for more processors?
PPS: I need to estimate benefits from parallelization of another command (not from Stata-supplied code), but before I do that, I wanted to make sure that the method I am using is consistent with the results reported in the Stata MP report mentioned above.
PPPS: I am also curios of super-efficient parallelization results for some commands presented in the report (e.g. by: replace) the performance of which grows more, than the number of CPU cores. (e.g. for by:replace 16 CPUs do the work 22.9 times faster than 1 CPU, how comes?? - page 19.)
Thank you, Sergiy Radyakin
Comment