[Numpy-discussion] numpy ufuncs and COREPY - any info?
Thu May 28 01:44:48 CDT 2009
A Tuesday 26 May 2009 15:14:39 Andrew Friedley escrigué:
> David Cournapeau wrote:
> > Francesc Alted wrote:
> >> Well, it is Andrew who should demonstrate that his measurement is
> >> correct, but in principle, 4 cycles/item *should* be feasible when using
> >> 8 cores in parallel.
> > But the 100x speed increase is for one core only unless I misread the
> > table. And I should have mentioned that 400 cycles/item for cos is on a
> > pentium 4, which has dreadful performances (defective L1). On a much
> > better core duo extreme something, I get 100 cycles / item (on a 64 bits
> > machines, though, and not same compiler, although I guess the libm
> > version is what matters the most here).
> > And let's not forget that there is the python wrapping cost: by doing
> > everything in C, I got ~ 200 cycle/cos on the PIV, and ~60 cycles/cos on
> > the core 2 duo (for double), using the rdtsc performance counter. All
> > this for 1024 items in the array, so very optimistic usecase (everything
> > in cache 2 if not 1).
> > This shows that python wrapping cost is not so high, making the 100x
> > claim a bit doubtful without more details on the way to measure speed.
> I appreciate all the discussion this is creating. I wish I could work
> on this more right now; I have a big paper deadline coming up June 1
> that I need to focus on.
> Yes, you're reading the table right. I should have been more clear on
> what my implementation is doing. It's using SIMD, so performing 4
> cosine's at a time where a libm cosine is only doing one. Also I don't
> think libm trancendentals are known for being fast; I'm also likely
> gaining performance by using a well-optimized but less accurate
> approximation. In fact a little more inspection shows my accuracy
> decreases as the input values increase; I will probably need to take a
> performance hit to fix this.
> I went and wrote code to use the libm fcos() routine instead of my cos
> code. Performance is equivalent to numpy, plus an overhead:
> inp sizes 1024 10240 102400 1024000 3072000
> numpy 0.7282 9.6278 115.5976 993.5738 3017.3680
> lmcos 1 0.7594 9.7579 116.7135 1039.5783 3156.8371
> lmcos 2 0.5274 5.7885 61.8052 537.8451 1576.2057
> lmcos 4 0.5172 5.1240 40.5018 313.2487 791.9730
> corepy 1 0.0142 0.0880 0.9566 9.6162 28.4972
> corepy 2 0.0342 0.0754 0.6991 6.1647 15.3545
> corepy 4 0.0596 0.0963 0.5671 4.9499 13.8784
> The times I show are in milliseconds; the system used is a dual-socket
> dual-core 2ghz opteron. I'm testing at the ufunc level, like this:
> def benchmark(fn, args):
> avgtime = 0
> for i in xrange(7):
> t1 = time.time()
> t2 = time.time()
> tm = t2 - t1
> avgtime += tm
> return avgtime / 7
> Where fn is a ufunc, ie numpy.cos. So I prime the execution once, then
> do 7 timings and take the average. I always appreciate suggestions on
> better way to benchmark things.
No, that seems good enough. But maybe you can present results in cycles/item.
This is a relatively common unit and has the advantage that it does not depend
on the frequency of your cores.
More information about the Numpy-discussion