I have looked at the code from post #1 in somewhat greater depth than I had before.
It seems to me that you are taking the mean of a column vector of length between 101 and 201 a total of 500,000 times.
I am now not so surprised that Stata/MP performs poorly on your problem. There's an overhead in setting up a problem for multiple processors, and I think the speed gain from using multiple processors to take the mean of 201 or fewer numbers is outweighed by the time necessary for the setup of quadcross() for parallel processing.
I expect that for sum() and quadsum() there is either (a) less setup time or (b) no setup time as they do not in fact take advantage of multiple processors. Explanation (b) is suggested by the results in post #9, which show similar times for 1 and 4 processors for the examples using sum() and quadsum().
So, I think what we're seeing could be titled "multiprocessing is slow in long loops of small calculations".
It suggests that when using Stata/MP an early step in evaluating poor performance should be to run it on a single processor and see if your code is perhaps spending too much time preparing small tasks for multiprocessing. In theory, the following code should demonstrate this sort of comparison for your task, but regrettably I cannot test it myself.
It seems to me that you are taking the mean of a column vector of length between 101 and 201 a total of 500,000 times.
Code:
: y1 = J(n, 1, .) : mn = n : mx = 0 : for (i = 1; i <= n; i++) { > y1[i] = mean(x[|ixl[i] \ ixu[i]|]) > z = x[|ixl[i] \ ixu[i]|] > mn = mn<rows(z) ? mn : rows(z) > mx = mx>rows(z) ? mx : rows(z) > } : : mn 101 : mx 201
I expect that for sum() and quadsum() there is either (a) less setup time or (b) no setup time as they do not in fact take advantage of multiple processors. Explanation (b) is suggested by the results in post #9, which show similar times for 1 and 4 processors for the examples using sum() and quadsum().
So, I think what we're seeing could be titled "multiprocessing is slow in long loops of small calculations".
It suggests that when using Stata/MP an early step in evaluating poor performance should be to run it on a single processor and see if your code is perhaps spending too much time preparing small tasks for multiprocessing. In theory, the following code should demonstrate this sort of comparison for your task, but regrettably I cannot test it myself.
Code:
timer clear set processors 1 timer on 1 mata for (i = 1; i <= n; i++) { y1[i] = mean(x[|ixl[i] \ ixu[i]|]) } end timer off 1 set processors `=c(processors_lic)' timer on 2 mata for (i = 1; i <= n; i++) { y1[i] = mean(x[|ixl[i] \ ixu[i]|]) } end timer off 2
Comment