[Numpy-discussion] numpy ufuncs and COREPY - any info?
Andrew Friedley
afriedle@indiana....
Tue May 26 08:14:39 CDT 2009
David Cournapeau wrote:
> Francesc Alted wrote:
>> Well, it is Andrew who should demonstrate that his measurement is correct, but
>> in principle, 4 cycles/item *should* be feasible when using 8 cores in
>> parallel.
>
> But the 100x speed increase is for one core only unless I misread the
> table. And I should have mentioned that 400 cycles/item for cos is on a
> pentium 4, which has dreadful performances (defective L1). On a much
> better core duo extreme something, I get 100 cycles / item (on a 64 bits
> machines, though, and not same compiler, although I guess the libm
> version is what matters the most here).
>
> And let's not forget that there is the python wrapping cost: by doing
> everything in C, I got ~ 200 cycle/cos on the PIV, and ~60 cycles/cos on
> the core 2 duo (for double), using the rdtsc performance counter. All
> this for 1024 items in the array, so very optimistic usecase (everything
> in cache 2 if not 1).
>
> This shows that python wrapping cost is not so high, making the 100x
> claim a bit doubtful without more details on the way to measure speed.
I appreciate all the discussion this is creating. I wish I could work
on this more right now; I have a big paper deadline coming up June 1
that I need to focus on.
Yes, you're reading the table right. I should have been more clear on
what my implementation is doing. It's using SIMD, so performing 4
cosine's at a time where a libm cosine is only doing one. Also I don't
think libm trancendentals are known for being fast; I'm also likely
gaining performance by using a well-optimized but less accurate
approximation. In fact a little more inspection shows my accuracy
decreases as the input values increase; I will probably need to take a
performance hit to fix this.
I went and wrote code to use the libm fcos() routine instead of my cos
code. Performance is equivalent to numpy, plus an overhead:
inp sizes 1024 10240 102400 1024000 3072000
numpy 0.7282 9.6278 115.5976 993.5738 3017.3680
lmcos 1 0.7594 9.7579 116.7135 1039.5783 3156.8371
lmcos 2 0.5274 5.7885 61.8052 537.8451 1576.2057
lmcos 4 0.5172 5.1240 40.5018 313.2487 791.9730
corepy 1 0.0142 0.0880 0.9566 9.6162 28.4972
corepy 2 0.0342 0.0754 0.6991 6.1647 15.3545
corepy 4 0.0596 0.0963 0.5671 4.9499 13.8784
The times I show are in milliseconds; the system used is a dual-socket
dual-core 2ghz opteron. I'm testing at the ufunc level, like this:
def benchmark(fn, args):
avgtime = 0
fn(*args)
for i in xrange(7):
t1 = time.time()
fn(*args)
t2 = time.time()
tm = t2 - t1
avgtime += tm
return avgtime / 7
Where fn is a ufunc, ie numpy.cos. So I prime the execution once, then
do 7 timings and take the average. I always appreciate suggestions on
better way to benchmark things.
Andrew
More information about the Numpy-discussion
mailing list