[Numpy-discussion] numpy ufuncs and COREPY - any info?

David Cournapeau david@ar.media.kyoto-u.ac...
Tue May 26 01:58:52 CDT 2009

Francesc Alted wrote:
> Well, it is Andrew who should demonstrate that his measurement is correct, but 
> in principle, 4 cycles/item *should* be feasible when using 8 cores in 
> parallel.

But the 100x speed increase is for one core only unless I misread the
table. And I should have mentioned that 400 cycles/item for cos is on a
pentium 4, which has dreadful performances (defective L1). On a much
better core duo extreme something, I get 100 cycles / item (on a 64 bits
machines, though, and not same compiler, although I guess the libm
version is what matters the most here).

And let's not forget that there is the python wrapping cost: by doing
everything in C, I got ~ 200 cycle/cos on the PIV, and ~60 cycles/cos on
the core 2 duo (for double), using the rdtsc performance counter. All
this for 1024 items in the array, so very optimistic usecase (everything
in cache 2 if not 1).

This shows that python wrapping cost is not so high, making the 100x
claim a bit doubtful without more details on the way to measure speed.



