[Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
Fri Jan 16 13:35:37 CST 2009
Francesc Alted schrieb:
> A Friday 16 January 2009, Gregor Thalhammer escrigué:
>> I also gave a try to the vector math library (VML), contained in
>> Intel's Math Kernel Library. This offers a fast implementation of
>> mathematical functions, operating on array. First I implemented a C
>> extension, providing new ufuncs. This gave me a big performance gain,
>> e.g., 2.3x (5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x
>> (6x) for division (no gain for add, sub, mul).
> Wow, pretty nice speed-ups indeed! In fact I was thinking in including
> support for threading in Numexpr (I don't think it would be too
> difficult, but let's see). BTW, do you know how VML is able to achieve
> a speedup of 6x for a sin() function? I suppose this is because they
> are using SSE instructions, but, are these also available for 64-bit
> double precision items?
I am not an expert on SSE instructions, but to my knowledge there exist
(in the Core 2 architecture) no SSE instruction to calculate the sin.
But it seems to be possible to (approximately) calculate a sin with a
couple of multiplication/ addition instructions (and they exist in SSE
for 64-bit float). Intel (and AMD) seems to use a more clever algorithm,
efficiently implemented than the standard implementation.
> Well, if you can provide the code, I'd be glad to include it in numexpr.
> The only requirement is that the VML must be optional during the build
> of the package.
Yes, I will try to provide you with a polished version of my changes,
making them optional.
>> There is one but: VML supports (at the moment) only math
>> on contiguous arrays. At a first try I didn't understand how to
>> enforce this limitation in numexpr.
> No problem. At the end of the numexpr/necompiler.py you will see some
> code like:
> # All the opcodes can deal with strided arrays directly as
> # long as they are undimensional (strides in other
> # dimensions are dealt within the extension), so we don't
> # need a copy for the strided case.
> if not b.flags.aligned:
> which you can replace with something like:
> # need a copy for the strided case.
> if VML_available and not b.flags.contiguous:
> b = b.copy()
> elif not b.flags.aligned:
> That would be enough for ensuring that all the arrays are contiguous
> when they hit numexpr's virtual machine.
Ah I see, that's not difficult. I thought copying is done in the virtual
machine. (didn't read all the code ...)
> Being said this, it is a shame that VML does not have support for
> strided/unaligned arrays. They are quite common beasts, specially when
> you work with heterogeneous arrays (aka record arrays).
I have the impression that you can already feel happy if these
mathematical libraries support a C interface, not only Fortran. At least
the Intel VML provides functions to pack/unpack strided arrays which
seem work on a broader parameter range than specified (also zero or
negative step sizes).
>> I also gave a quick try to the
>> equivalent vector math library, acml_mv of AMD. I only tried sin and
>> log, gave me the same performance (on a Intel processor!) like Intels
>> VML .
>> I was also playing around with the block size in numexpr. What are
>> the rationale that led to the current block size of 128? Especially
>> with VML, a larger block size of 4096 instead of 128 allowed to
>> efficiently use multithreading in VML.
> Experimentation. Back in 2006 David found that 128 was optimal for the
> processors available by that time. With the Numexpr 1.1 my experiments
> show that 256 is a better value for current Core2 processors and most
> expressions in our benchmark bed (see benchmarks/ directory); hence,
> 256 is the new value for the chunksize in 1.1. However, be in mind
> that 256 has to be multiplied by the itemsize of each array, so the
> chunksize is currently 2048 bytes for 64-bit items (int64 or float64)
> and 4096 for double precision complex arrays, which are probably the
> sizes that have to be compared with VML.
So the optimum block size might depend on the type of expression and
if VML functions are used. On question: the block size is set by a
#define, is there a significantly poorer performance if you use a
variable instead? Would be more flexible, especially for testing and tuning.
>> I was missing the support for single precision floats.
> Yeah. This is because nobody has implemented it before, but it is
> completely doable.
More information about the Numpy-discussion