[Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

Gregor Thalhammer gregor.thalhammer@gmail....
Fri Jan 16 13:35:37 CST 2009

Francesc Alted schrieb:
> A Friday 16 January 2009, Gregor Thalhammer escrigué:
>> I also gave a try to the vector math library (VML), contained in
>> Intel's Math Kernel Library. This offers a fast implementation of
>> mathematical functions, operating on array. First I implemented a C
>> extension, providing new ufuncs. This gave me a big performance gain,
>> e.g., 2.3x (5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x
>> (6x) for division (no gain for add, sub, mul).
> Wow, pretty nice speed-ups indeed!  In fact I was thinking in including 
> support for threading in Numexpr (I don't think it would be too 
> difficult, but let's see).  BTW, do you know how VML is able to achieve 
> a speedup of 6x for a sin() function?  I suppose this is because they 
> are using SSE instructions, but, are these also available for 64-bit 
> double precision items?
I am not an expert on SSE instructions, but to my knowledge there exist 
(in the Core 2 architecture) no SSE instruction to calculate the sin. 
But it seems to be possible to (approximately) calculate a sin with a 
couple of multiplication/ addition instructions (and they exist in SSE 
for 64-bit float). Intel (and AMD) seems to use a more clever algorithm, 
efficiently implemented than the standard implementation.
> Well, if you can provide the code, I'd be glad to include it in numexpr.  
> The only requirement is that the VML must be optional during the build 
> of the package.
Yes, I will try to provide you with a polished version of my changes, 
making them optional.
>> There is one but: VML supports (at the moment) only math 
>> on contiguous arrays. At a first try I didn't understand how to
>> enforce this limitation in numexpr.
> No problem.  At the end of the numexpr/necompiler.py you will see some 
> code like:
>         # All the opcodes can deal with strided arrays directly as
>         # long as they are undimensional (strides in other
>         # dimensions are dealt within the extension), so we don't
>         # need a copy for the strided case.
>         if not b.flags.aligned:
>            ...
> which you can replace with something like:
>         # need a copy for the strided case.
>         if VML_available and not b.flags.contiguous: 
> 	    b = b.copy()
>         elif not b.flags.aligned:
> 	    ...
> That would be enough for ensuring that all the arrays are contiguous 
> when they hit numexpr's virtual machine.
Ah I see, that's not difficult. I thought copying is done in the virtual 
machine. (didn't read all the code ...)
> Being said this, it is a shame that VML does not have support for 
> strided/unaligned arrays.  They are quite common beasts, specially when 
> you work with heterogeneous arrays (aka record arrays).
I have the impression that you can already feel happy if these 
mathematical libraries support a C interface, not only Fortran. At least 
the Intel VML provides functions to pack/unpack strided arrays which 
seem work on a broader parameter range than specified (also zero or 
negative step sizes).
>> I also gave a quick try to the 
>> equivalent vector math library, acml_mv of AMD. I only tried sin and
>> log, gave me the same performance (on a Intel processor!) like Intels
>> VML .
>> I was also playing around with the block size in numexpr. What are
>> the rationale that led to the current block size of 128? Especially
>> with VML, a larger block size of 4096 instead of 128 allowed  to
>> efficiently use multithreading in VML.
> Experimentation.  Back in 2006 David found that 128 was optimal for the 
> processors available by that time.  With the Numexpr 1.1 my experiments 
> show that 256 is a better value for current Core2 processors and most 
> expressions in our benchmark bed (see benchmarks/ directory); hence, 
> 256 is the new value for the chunksize in 1.1.  However, be in mind 
> that 256 has to be multiplied by the itemsize of each array, so the 
> chunksize is currently 2048 bytes for 64-bit items (int64 or float64) 
> and 4096 for double precision complex arrays, which are probably the 
> sizes that have to be compared with VML.
So the optimum block size  might depend  on the type of expression and 
if VML functions are used. On question: the block size is set by a 
#define, is there a significantly poorer performance if you use a 
variable instead? Would be more flexible, especially for testing and tuning.
>> I was missing the support for single precision floats.
> Yeah.  This is because nobody has implemented it before, but it is 
> completely doable.

More information about the Numpy-discussion mailing list