Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I have looked at the code from post #1 in somewhat greater depth than I had before.

    It seems to me that you are taking the mean of a column vector of length between 101 and 201 a total of 500,000 times.
    Code:
    : y1 = J(n, 1, .)
    
    : mn = n
    
    : mx = 0
    
    : for (i = 1; i <= n; i++) {
    >     y1[i] = mean(x[|ixl[i] \ ixu[i]|])
    >         z = x[|ixl[i] \ ixu[i]|]
    >         mn = mn<rows(z) ? mn : rows(z)
    >         mx = mx>rows(z) ? mx : rows(z)
    > }
    
    : 
    : mn
      101
    
    : mx
      201
    I am now not so surprised that Stata/MP performs poorly on your problem. There's an overhead in setting up a problem for multiple processors, and I think the speed gain from using multiple processors to take the mean of 201 or fewer numbers is outweighed by the time necessary for the setup of quadcross() for parallel processing.

    I expect that for sum() and quadsum() there is either (a) less setup time or (b) no setup time as they do not in fact take advantage of multiple processors. Explanation (b) is suggested by the results in post #9, which show similar times for 1 and 4 processors for the examples using sum() and quadsum().

    So, I think what we're seeing could be titled "multiprocessing is slow in long loops of small calculations".

    It suggests that when using Stata/MP an early step in evaluating poor performance should be to run it on a single processor and see if your code is perhaps spending too much time preparing small tasks for multiprocessing. In theory, the following code should demonstrate this sort of comparison for your task, but regrettably I cannot test it myself.
    Code:
    timer clear
    set processors 1
    timer on 1
    mata
    for (i = 1; i <= n; i++) {
        y1[i] = mean(x[|ixl[i] \ ixu[i]|])
    }
    end
    timer off 1
    set processors `=c(processors_lic)'
    timer on 2
    mata
    for (i = 1; i <= n; i++) {
        y1[i] = mean(x[|ixl[i] \ ixu[i]|])
    }
    end
    timer off 2

    Comment


    • #17
      Here's what I get (Stata 16.1/Win 64)
      Code:
      . cscript
      -------------------------------------------------------------------------BEGIN
      
      . timer clear
      
      .
      . mata
      ------------------------------------------------- mata (type end to exit) -----
      : real scalar function my_mean (real vector x)
      &gt; {
      &gt;     return(sum(x) / (length(x) - missing(x)))
      &gt; }
      
      :
      : real scalar function my_quadmean (real vector x)
      &gt; {
      &gt;     return(quadsum(x) / (length(x) - missing(x)))
      &gt; }
      
      :
      : n = 500000
      
      : x = runiform(n, 1)
      
      : x[selectindex(x :&lt; 0.1)] = J(sum(x :&lt; 0.1), 1, missingof(x))
      
      : ixl = (1::n) :- 100
      
      : ixu = (1::n) :+ 100
      
      : ixl[selectindex(ixl :&lt; 1)] = J(sum(ixl :&lt; 1), 1, 1)
      
      : ixu[selectindex(ixu :&gt; n)] = J(sum(ixu :&gt; n), 1, n)
      
      : y1 = J(n, 1, .)
      
      : y2 = J(n, 1, .)
      
      : y3 = J(n, 1, .)
      
      : end
      -------------------------------------------------------------------------------
      
      .
      .
      . set processors 1
          The maximum number of processors or cores being used is changed from 4 to
          1.  It can be set to any number between 1 and 4
      
      . timer on 1
      
      . mata
      ------------------------------------------------- mata (type end to exit) -----
      : for (i = 1; i &lt;= n; i++) {
      &gt;     y1[i] = mean(x[|ixl[i] \ ixu[i]|])
      &gt; }
      
      : end
      -------------------------------------------------------------------------------
      
      . timer off 1
      
      . set processors `=c(processors_lic)'
          The maximum number of processors or cores being used is changed from 1 to
          4.  It can be set to any number between 1 and 4
      
      . timer on 2
      
      . mata
      ------------------------------------------------- mata (type end to exit) -----
      : for (i = 1; i &lt;= n; i++) {
      &gt;     y1[i] = mean(x[|ixl[i] \ ixu[i]|])
      &gt; }
      
      : end
      -------------------------------------------------------------------------------
      
      . timer off 2
      
      . timer list
         1:      4.71 /        1 =       4.7140
         2:     19.86 /        1 =      19.8640

      Comment


      • #18
        William Lisowski I agree about the title change; further, this in fact it's only a problem on windows (and with cross/quadcross rather than with mean per se). However, I am not sure how to change the title.

        The number of processors does not seem to make much of a difference, by the way, so there has to be some fixed cost in setting up MP. Here is the result of your snippet for me, with this added:
        Code:
        set processors 2
        timer on 3
        mata
        for (i = 1; i <= n; i++) {
            y1[i] = mean(x[|ixl[i] \ ixu[i]|])
        }
        end
        timer off 3
        Code:
        . timer list
           1:      9.34 /        1 =       9.3380
           2:    107.03 /        1 =     107.0260
           3:     98.03 /        1 =      98.0330
        PS: I initially discovered this issue when using rangestat; the example is just a MWE for illustration. The live problem was that taking the mean and standard deviation with rangestat took an hour; if I coded the mean and sd myself (in mata) then rangestat only took half a minute.
        Last edited by Mauricio Caceres; 01 Jun 2020, 16:37.

        Comment

        Working...
        X