[Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
Sat Jan 17 03:50:15 CST 2009
On Sat, Jan 17, 2009 at 4:35 AM, Gregor Thalhammer
> Francesc Alted schrieb:
>> A Friday 16 January 2009, Gregor Thalhammer escrigué:
>>> I also gave a try to the vector math library (VML), contained in
>>> Intel's Math Kernel Library. This offers a fast implementation of
>>> mathematical functions, operating on array. First I implemented a C
>>> extension, providing new ufuncs. This gave me a big performance gain,
>>> e.g., 2.3x (5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x
>>> (6x) for division (no gain for add, sub, mul).
>> Wow, pretty nice speed-ups indeed! In fact I was thinking in including
>> support for threading in Numexpr (I don't think it would be too
>> difficult, but let's see). BTW, do you know how VML is able to achieve
>> a speedup of 6x for a sin() function? I suppose this is because they
>> are using SSE instructions, but, are these also available for 64-bit
>> double precision items?
> I am not an expert on SSE instructions, but to my knowledge there exist
> (in the Core 2 architecture) no SSE instruction to calculate the sin.
> But it seems to be possible to (approximately) calculate a sin with a
> couple of multiplication/ addition instructions (and they exist in SSE
> for 64-bit float). Intel (and AMD) seems to use a more clever algorithm,
> efficiently implemented than the standard implementation.
Generally, transcendent functions are not sped up because they are
implemented in hardware. There is no special algorithm: you implement
those as you would in C using Taylor expansions or other known
polynomial expansions, except you use SIMD to implement those
polynomial expansions. You can also use table lookup, which can be
pretty fast to get full precision for trigonometric functions.
musicdsp.org has some of those (take care a lot of those tricks do not
give full precision - it is used for music synthesis, where full
precision is rarely needed and speed of uttermost importance):
There were some examples on freescale.com, in full precision, but I
can't find it anymore.
For some functions, you can get almost one order of magnitude faster
transcendental functions (for full precision), but it is a lot of work
to make sure they work as expected in a cross platform way (even
limiting to one CPU arch when using asm, there are differences between
compilers which make this rather difficult).
More information about the Numpy-discussion