[Numpy-discussion] Vectorizing code, for loops, and all that

Travis Oliphant oliphant at ee.byu.edu
Mon Oct 2 19:32:26 CDT 2006

Travis Oliphant wrote:

>I suspect I know why, although the difference seems rather large.  

>I'm surprised the overhead of adjusting pointers is so high, but then 
>again you are probably getting a lot of cache misses in the first case 
>so there is more to it than that, the loops may run more slowly too.

I'm personally bothered that this example runs so much more slowly.  I 
don't think it should.  Perhaps it is unavoidable because of the 
memory-layout issues.  It is just hard to believe that the overhead for 
calling into the loop and adjusting the pointers is so much higher. 

But, that isn't the problem, here.  Notice the following:

x3 = N.random.rand(39,2000)
x4 = N.random.rand(39,64,1)

%timeit z3 = x3[:,None,:] - x4

10 loops, best of 3: 76.4 ms per loop

Hmm... It looks like cache misses are a lot more important than making 
sure the inner loop is taken over the largest number of variables 
(that's the current way ufuncs decide which axis ought to be used as the 
1-d loop). 

Perhaps those inner 1-d loops could be optimized (using prefetch or 
something) to reduce the number of cache misses on the inner 
computation, and the concept of looping over the largest dimension 
(instead of the last dimension) should be re-considered.



More information about the Numpy-discussion mailing list