[Numpy-discussion] numpy ufuncs and COREPY - any info?

Andrew Friedley afriedle@indiana....
Tue May 26 08:14:39 CDT 2009

David Cournapeau wrote:
> Francesc Alted wrote:
>> Well, it is Andrew who should demonstrate that his measurement is correct, but 
>> in principle, 4 cycles/item *should* be feasible when using 8 cores in 
>> parallel.
> But the 100x speed increase is for one core only unless I misread the
> table. And I should have mentioned that 400 cycles/item for cos is on a
> pentium 4, which has dreadful performances (defective L1). On a much
> better core duo extreme something, I get 100 cycles / item (on a 64 bits
> machines, though, and not same compiler, although I guess the libm
> version is what matters the most here).
> And let's not forget that there is the python wrapping cost: by doing
> everything in C, I got ~ 200 cycle/cos on the PIV, and ~60 cycles/cos on
> the core 2 duo (for double), using the rdtsc performance counter. All
> this for 1024 items in the array, so very optimistic usecase (everything
> in cache 2 if not 1).
> This shows that python wrapping cost is not so high, making the 100x
> claim a bit doubtful without more details on the way to measure speed.

I appreciate all the discussion this is creating.  I wish I could work 
on this more right now; I have a big paper deadline coming up June 1 
that I need to focus on.

Yes, you're reading the table right.  I should have been more clear on 
what my implementation is doing.  It's using SIMD, so performing 4 
cosine's at a time where a libm cosine is only doing one.  Also I don't 
think libm trancendentals are known for being fast; I'm also likely 
gaining performance by using a well-optimized but less accurate 
approximation.  In fact a little more inspection shows my accuracy 
decreases as the input values increase; I will probably need to take a 
performance hit to fix this.

I went and wrote code to use the libm fcos() routine instead of my cos 
code.  Performance is equivalent to numpy, plus an overhead:

inp sizes      1024    10240   102400  1024000  3072000
numpy        0.7282   9.6278 115.5976  993.5738 3017.3680

lmcos    1   0.7594   9.7579 116.7135 1039.5783 3156.8371
lmcos    2   0.5274   5.7885  61.8052  537.8451 1576.2057
lmcos    4   0.5172   5.1240  40.5018  313.2487  791.9730

corepy   1   0.0142   0.0880   0.9566    9.6162   28.4972
corepy   2   0.0342   0.0754   0.6991    6.1647   15.3545
corepy   4   0.0596   0.0963   0.5671    4.9499   13.8784

The times I show are in milliseconds; the system used is a dual-socket 
dual-core 2ghz opteron.  I'm testing at the ufunc level, like this:

def benchmark(fn, args):
   avgtime = 0

   for i in xrange(7):
     t1 = time.time()
     t2 = time.time()

     tm = t2 - t1
     avgtime += tm

   return avgtime / 7

Where fn is a ufunc, ie numpy.cos.  So I prime the execution once, then 
do 7 timings and take the average.  I always appreciate suggestions on 
better way to benchmark things.


More information about the Numpy-discussion mailing list