Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata MP performance

    Dear All,

    I am trying to understand why I fail to reproduce the figures reported in the StataMP Performance Report with my Stata installation.

    Importantly, before I proceed, I wanted to underline that I do not doubt the results presented in the report, which were certainly obtained using a more rigorous approach and verified numerous times. Instead, I am trying to find my own mistake in replicating them.

    The report is here:
    https://www.stata.com/statamp/performance-report/report.pdf
    (Revision 3.4.0 25sep2023)

    My test machine is as follows:
    OS: Windows 11
    CPU: 11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz 1.50 GHz
    RAM: 16.0 GB



    This is the performance I estimate for the regress command:

    Code:
    Average execution time over 25 runs (in seconds)
        CPUs   Time   Ratio  
           1   5.04    1.00  
           2   2.99    1.69  
           3   2.32    2.17  
           4   2.48    2.03
    Which means that with 4 CPUs I can cut the execution time roughly in 2 (=5.04/2.48) compared to a single CPU, but according to the report mentioned above I should get a value close to 4. (Because of "Stata’s linear regression command, regress, very nearly achieves theoretical limits (see figure 4); its relative speeds increase in almost direct proportion to the number of cores.", see page 8 of the report)

    The results shown in the table above are obtained with the following code

    Code:
    clear all
    
    // Create test dataset
    set seed 1000
    sysuse nlsw88
    expand 10000
    generate r=runiform()*1000
    count
    
    // Run the benchmark
    benchmark , cmd("regress wage grade age tenure r") repeat(25) save(RESULTS)
    And benchmark is a user-written command from here.

    Are the performance advantages shown in the report available only for some size of data? or are there any critical overheads that must be considered? Is the timer command not suitable for these measurements? Is there a bug in my code, perhaps?

    I am hoping that the size of the task is the cause based on the text (page 15): "...with problems 100 to 10,000 times smaller and run times of 0.4 seconds to just over 4 seconds on a machine running at 2.2–3.4 GHz, substantial speedups were still observed. Among commands that were at least 50% parallelized, more than half exhibited greater than 90% of the speedup exhibited on the larger problems." But since the example I used contained ~22.5mln observations, which is already more than what I can expect in the kind of data I commonly work with.

    I tried to replicate the results multiple times and average them, so that I can eliminate the noise coming from other applications running on the computer in parallel to Stata. What else can I possibly try?

    I observe the CPU utilization during the test runs and it doesn't reach 100% (over all cores) which gives me some reassurance that it is not the other processes that affect the benchmarking results. I also don't see memory usage going to all available memory giving me some reassurance that it is not the memory congestion that spoils the results.

    Page 7 of the report mentions that "To reduce the impact of interruptions by the operating system, the timings were repeated three times and the shortest time was recorded.". Is that more preferrable than averaging over multiple runs?? (taking the min run times still does not reproduce the report's results).

    PS: my Stata license is restricted to 4 CPUs, so I couldn't really get it to the higher CPUs parallelization. If anyone has access to a higher-CPUs license, could you please re-run the same code and indicate what it looks like for more processors?

    PPS: I need to estimate benefits from parallelization of another command (not from Stata-supplied code), but before I do that, I wanted to make sure that the method I am using is consistent with the results reported in the Stata MP report mentioned above.

    PPPS: I am also curios of super-efficient parallelization results for some commands presented in the report (e.g. by: replace) the performance of which grows more, than the number of CPU cores. (e.g. for by:replace 16 CPUs do the work 22.9 times faster than 1 CPU, how comes?? - page 19.)

    Thank you, Sergiy Radyakin
    Click image for larger version

Name:	ratio.png
Views:	1
Size:	59.6 KB
ID:	1767284

  • #2
    I ran your code and these are my results:

    Code:
        CPUs   Time   Ratio  
           1   7.46    1.00  
           2   3.89    1.92  
           3   2.86    2.61  
           4   2.71    2.75
    From estimation commands, I would expect diminishing returns with more CPU utilization, so I think the dip you see in the 4 CPU is a noise artefact. Beyond that, I could only (unhelpfully) speculate as to the cause.

    Comment


    • #3
      My benchmark seems more in line with Stata report:

      Code:
      . // Run the benchmark
      . benchmark , cmd("regress wage grade age tenure r") repeat(25) save(RESULTS)
      CPUs is set to: 1.........................
      CPUs is set to: 2.........................
      CPUs is set to: 3.........................
      CPUs is set to: 4.........................
       
      Average execution time over 25 runs (in seconds)
      
      
          CPUs   Time   Ratio  
             1   6.79    1.00  
             2   3.35    2.03  
             3   2.40    2.83  
             4   1.95    3.49

      Comment


      • #4
        This is a good question. In my previous work, I also observed that runtimes didn’t scale linearly with the number of CPUs, and I was curious about the reasons behind this. Here are my benchmark results.

        Code:
        -------------------------------------------------------------------------------
        Date and time:  9 Nov 2024 13:15:32
        Stata version: 18
        Updated as of: 16 Oct 2024
        Variant:       MP
        Processors:    8
        OS:            Windows 64-bit
        Machine type:  PC (64-bit x86-64)
        -------------------------------------------------------------------------------
        
        . // Run the benchmark
        . benchmark , cmd("regress wage grade age tenure r") repeat(25) save(RESULTS)
        CPUs is set to: 1.........................
        CPUs is set to: 2.........................
        CPUs is set to: 3.........................
        CPUs is set to: 4.........................
        CPUs is set to: 5.........................
        CPUs is set to: 6.........................
        CPUs is set to: 7.........................
        CPUs is set to: 8.........................
         
        Average execution time over 25 runs (in seconds)
        
        
            CPUs   Time   Ratio  
               1   5.62    1.00  
               2   2.74    2.05  
               3   2.14    2.63  
               4   1.81    3.10  
               5   1.58    3.56  
               6   1.50    3.75  
               7   1.46    3.85  
               8   1.41    3.98
        Associate Professor of Finance and Economics
        University of Illinois
        www.julianreif.com

        Comment

        Working...
        X