[Numpy-discussion] Vectorizing code, for loops, and all that
oliphant at ee.byu.edu
Mon Oct 2 19:32:26 CDT 2006
Travis Oliphant wrote:
>I suspect I know why, although the difference seems rather large.
>I'm surprised the overhead of adjusting pointers is so high, but then
>again you are probably getting a lot of cache misses in the first case
>so there is more to it than that, the loops may run more slowly too.
I'm personally bothered that this example runs so much more slowly. I
don't think it should. Perhaps it is unavoidable because of the
memory-layout issues. It is just hard to believe that the overhead for
calling into the loop and adjusting the pointers is so much higher.
But, that isn't the problem, here. Notice the following:
x3 = N.random.rand(39,2000)
x4 = N.random.rand(39,64,1)
%timeit z3 = x3[:,None,:] - x4
10 loops, best of 3: 76.4 ms per loop
Hmm... It looks like cache misses are a lot more important than making
sure the inner loop is taken over the largest number of variables
(that's the current way ufuncs decide which axis ought to be used as the
Perhaps those inner 1-d loops could be optimized (using prefetch or
something) to reduce the number of cache misses on the inner
computation, and the concept of looping over the largest dimension
(instead of the last dimension) should be re-considered.
More information about the Numpy-discussion