[Numpy-discussion] numexpr efficency depends on the size of the computing kernel

Francesc Altet faltet@carabos....
Thu Apr 26 07:19:16 CDT 2007

Just a quick followup about this issue:

After a bit of investigation, I discovered that the responsible for the 
difference in performance between the original numexpr and their PyTables 
counterpart (see the message below) was due *only* to the different flags 
used for compiling (and not to a cache instruction overload in the CPU).

It turns out that the original numexpr always add the '-O2 -funroll-all-loops' 
flag (GCC compiler), while I compiled the PyTables instance with the python 
default (-O3).  After recompiling the latter using the same flags as original 
numexpr, then I get exactly the same results on either version of numexpr , 
even with a processor with as a small secondary cache as 64 KB (AMD Duron)
(i.e. the '-funroll-all-loops' flag seems to be *very* effective for 
optimizing the computing kernel of numexpr, at least with CPUs with small 

So, at least, this leads to the conclusion that the numexpr's virtual machine 
is still far away from getting overloaded, most specially with nowadays 
processors with 512KB of secondary cache or more.


A Dimecres 14 Març 2007 22:05, Francesc Altet escrigué:
> Hi,
> Now that I'm commanding my old AMD Duron machine, I've made some
> benchmarks just to prove that the numexpr computing is not influenced by
> the size of the CPU cache, but I failed miserably (and Tim was right:
> there is a dependency of the numexpr efficency on CPU cache size).
> Provided that the pytables instance of the computing kernel of numexpr
> is quite larger (it supports more datatypes) than the original,
> comparing the performance of both versions can be a good way to check
> the influence of CPU cache on the computing efficency.
> The attached benchmark is a small modification of the timing.py that
> comes with the numexpr package (the modification was needed to allow the
> numexpr version of pytables to run all the cases). Basically, the
> expressions tested operations with arrays of 1 million of elements, with
> a mix of contiguous and strided arrays (no unaligned arrays are present
> here). See the code in benchmark for the details.
> The speed-ups of numexpr over plain numpy on a AMD Duron machine (64 +
> 64 KB L1 cache, 64 KB L2 cache) are:
> For the original numexpr package:
> 2.14, 2.21, 2.21  (these represent averages for 3 complete runs)
> For the modified pytables version (enlarged computing kernel):
> 1.32, 1.34, 1.37
> So, with a CPU with a very small cache, the original numexpr kernel is
> 1.6x faster than the pytables one.
> However, using a AMD Opteron which has a much bigger L2 cache (64 + 64
> KB L1 cache, 1 MB L2 cache), the speed-ups are quite similar:
> For the original numexpr package:
> 3.10, 3.35, 3.35
> For the modified pytables version (enlarged computing kernel):
> 3.37, 3.50, 3.45
> So, there is effectively a dependency on the CPU cache size. It would be
> nice to run the benchmark with other CPUs with a L2 cache in the range
> between 64 KB and 1 MB so as to find the point where the performance
> starts to be similar (this should be a good guess on the size of the
> computing kernel).
> Meanwhile, the lesson learned is that Tim worries were correct: one
> should be very careful on adding more opcodes (at least, until CPUs with
> a very small L2 cache are in use).  With this, perhaps we will have to
> reduce the opcodes in the numexpr version for pytables to a bare
> minimum :-/
> Cheers,

>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data

More information about the Numpy-discussion mailing list