[Numpy-discussion] Need help for implementing a fast clip in numpy (was slow clip)

Francesc Altet faltet at carabos.com
Thu Jan 11 12:00:25 CST 2007


El dv 12 de 01 del 2007 a les 00:58 +0900, en/na David Cournapeau va
escriure:
> David Cournapeau wrote:
> > Francesc Altet wrote:
> >> A Dimecres 10 Gener 2007 22:49, Stefan van der Walt escrigué:
> >>> On Wed, Jan 10, 2007 at 08:28:14PM +0100, Francesc Altet wrote:
> >>>> El dt 09 de 01 del 2007 a les 23:19 +0900, en/na David Cournapeau va
> >>>>
> >>>> escriure:
> >>>> time (putmask)--> 1.38
> >>>> time (where)--> 2.713
> >>>> time (numexpr where)--> 1.291
> >>>> time (fancy+assign)--> 0.967
> >>>> time (numexpr clip)--> 0.596
> >>>>
> >>>> It is interesting to see there how fancy-indexing + assignation is 
> >>>> quite
> >>>> more efficient than putmask.
> >>> Not on my machine:
> >>>
> >>> time (putmask)--> 0.181
> >>> time (where)--> 0.783
> >>> time (numexpr where)--> 0.26
> >>> time (fancy+assign)--> 0.202
> >>
> >> Yeah, a lot of difference indeed. Just for reference, my results 
> >> above were done using a Duron (an Athlon but with only 128 KB of 
> >> secondary cache) at 0.9 GHz. Now, using my laptop (Intel 4 @ 2 GHz, 
> >> 512 KB of secondary cache), I get:
> >>
> >> time (putmask)--> 0.244
> >> time (where)--> 2.111
> >> time (numexpr where)--> 0.427
> >> time (fancy+assign)--> 0.316
> >> time (numexpr clip)--> 0.184
> >>
> >> so, on my laptop fancy+assign is way slower than putmask. It should 
> >> be noted also that the implementation of clip in numexpr (i.e. in 
> >> pure C) is not that much faster than putmask (just a 30%); so perhaps 
> >> it is not so necessary to come up with a pure C implementation for 
> >> clip (or at least, on Intel P4 machines!).
> >>
> >> In any case, it is really shocking seeing how differently can perform 
> >> the several CPU architectures on this apparently simple problem.
> > I am not sure it is such a simple problem: it involves massive branching.
> To be more precise, you can do clipping without branching, but then the 
> clipping is highly type and machine dependent (using bit mask and other 
> tricks). It may worth the trouble for double, float and int, dunno.

Well, I don't know which this trick can be, but just for completeness I
have run the benchmark on an AMD Opteron (2 GHZ, 1 MB of secondary
cache) machine, and here are the timings:

time (putmask)--> 0.5
time (where)--> 1.035
time (numexpr where)--> 0.263
time (fancy+assign)--> 0.311
time (numpy clip)--> 0.704
time (numexpr clip)--> 0.089

[Note that I've added the clip from numpy. See the new benchmark
attached]

Curiously enough, fancy+assign is again faster than putmask. Also
interesting is the fact that AMD seems to have optimized the branching
in Opterons quite a lot: the processor is only 2x faster in GHz than the
Duron, and most of the benchs do run 3x or 4x faster, but the numexpr
clip does perform more than 6x faster (!).

-- 
Francesc Altet    |  Be careful about using the following code --
Carabos Coop. V.  |  I've only proven that it works, 
www.carabos.com   |  I haven't tested it. -- Donald Knuth
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clip-bench2.py
Type: text/x-python
Size: 1048 bytes
Desc: not available
Url : http://projects.scipy.org/pipermail/numpy-discussion/attachments/20070111/2f0462f6/attachment.py 


More information about the Numpy-discussion mailing list