[Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

Francesc Alted faltet@pytables....
Tue Jan 20 08:13:01 CST 2009

A Tuesday 20 January 2009, Andrew Collette escrigué:
> Hi Francesc,
> Looks like a cool project!  However, I'm not able to achieve the
> advertised speed-ups.  I wrote a simple script to try three
> approaches to this kind of problem:
> 1) Native Python code (i.e. will try to do everything at once using
> temp arrays) 2) Straightforward numexpr evaluation
> 3) Simple "chunked" evaluation using array.flat views.  (This solves
> the memory problem and allows the use of arbitrary Python
> expressions).
> I've attached the script; here's the output for the expression
> "63 + (a*b) + (c**2) + sin(b)"
> along with a few combinations of shapes/dtypes.  As expected, using
> anything other than "f8" (double) results in a performance penalty.
> Surprisingly, it seems that using chunks via array.flat results in
> similar performance for f8, and even better performance for other
> dtypes.

Well, there were two issues there.  The first one is that when 
transcendental functions are used (like sin() above), the bottleneck is 
on the CPU instead of memory bandwidth, so numexpr speedups are not so 
high as usual.  The other issue was an actual bug in the numexpr code 
that forced a copy of all multidimensional arrays (I normally only use 
undimensional arrays for doing benchmarks).  This has been fixed in 
trunk (r39).

So, with the fix on, the timings are:

(100, 100, 100) f4 (average of 10 runs)
Simple:  0.0426136016846
Numexpr:  0.11350851059
Chunked:  0.0635252952576
(100, 100, 100) f8 (average of 10 runs)
Simple:  0.119254398346
Numexpr:  0.10092959404
Chunked:  0.128384995461

The speed-up is now a mere 20% (for f8), but at least it is not slower.  
With the patches that recently contributed Georg for using Intel's VML, 
the acceleration is a bit better:

(100, 100, 100) f4 (average of 10 runs)
Simple:  0.0417867898941
Numexpr:  0.0944641113281
Chunked:  0.0636183023453
(100, 100, 100) f8 (average of 10 runs)
Simple:  0.120059680939
Numexpr:  0.0832288980484
Chunked:  0.128114104271

i.e. the speed-up is around 45% (for f8).

Moreover, if I get rid of the sin() function and use the expresion:

"63 + (a*b) + (c**2) + b"

I get:

(100, 100, 100) f4 (average of 10 runs)
Simple:  0.0119329929352
Numexpr:  0.0198570966721
Chunked:  0.0338240146637
(100, 100, 100) f8 (average of 10 runs)
Simple:  0.0255623102188
Numexpr:  0.00832500457764
Chunked:  0.0340095996857

which has a 3.1x speedup (for f8).

> FYI, the current tar file (1.1-1) has a glitch related to the VERSION
> file; I added to the bug report at google code.

Thanks. Will focus on that asap.  Mmm, seems like there is stuff enough 
for another release of numexpr.  I'll try to do it soon.


Francesc Alted

More information about the Numpy-discussion mailing list