[Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
Fri Jan 16 10:48:34 CST 2009
A Friday 16 January 2009, Gregor Thalhammer escrigué:
> I also gave a try to the vector math library (VML), contained in
> Intel's Math Kernel Library. This offers a fast implementation of
> mathematical functions, operating on array. First I implemented a C
> extension, providing new ufuncs. This gave me a big performance gain,
> e.g., 2.3x (5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x
> (6x) for division (no gain for add, sub, mul).
Wow, pretty nice speed-ups indeed! In fact I was thinking in including
support for threading in Numexpr (I don't think it would be too
difficult, but let's see). BTW, do you know how VML is able to achieve
a speedup of 6x for a sin() function? I suppose this is because they
are using SSE instructions, but, are these also available for 64-bit
double precision items?
> The values in
> parantheses are given if I allow VML to use several threads and to
> employ both cores of my Intel Core2Duo computer. For large arrays
> (100M entries) this performance gain is reduced because of limited
> memory bandwidth. At this point I was stumbling across numexpr and
> modified it to use the VML functions. For sufficiently long and
> complex numerical expressions I could get the maximum performance
> also for large arrays.
> Together with VML numexpr seems to be a
> extremely powerful to get an optimum performance. I would like to see
> numexpr extended to (optionally) make use of fast vectorized math
Well, if you can provide the code, I'd be glad to include it in numexpr.
The only requirement is that the VML must be optional during the build
of the package.
> There is one but: VML supports (at the moment) only math
> on contiguous arrays. At a first try I didn't understand how to
> enforce this limitation in numexpr.
No problem. At the end of the numexpr/necompiler.py you will see some
# All the opcodes can deal with strided arrays directly as
# long as they are undimensional (strides in other
# dimensions are dealt within the extension), so we don't
# need a copy for the strided case.
if not b.flags.aligned:
which you can replace with something like:
# need a copy for the strided case.
if VML_available and not b.flags.contiguous:
b = b.copy()
elif not b.flags.aligned:
That would be enough for ensuring that all the arrays are contiguous
when they hit numexpr's virtual machine.
Being said this, it is a shame that VML does not have support for
strided/unaligned arrays. They are quite common beasts, specially when
you work with heterogeneous arrays (aka record arrays).
> I also gave a quick try to the
> equivalent vector math library, acml_mv of AMD. I only tried sin and
> log, gave me the same performance (on a Intel processor!) like Intels
> VML .
> I was also playing around with the block size in numexpr. What are
> the rationale that led to the current block size of 128? Especially
> with VML, a larger block size of 4096 instead of 128 allowed to
> efficiently use multithreading in VML.
Experimentation. Back in 2006 David found that 128 was optimal for the
processors available by that time. With the Numexpr 1.1 my experiments
show that 256 is a better value for current Core2 processors and most
expressions in our benchmark bed (see benchmarks/ directory); hence,
256 is the new value for the chunksize in 1.1. However, be in mind
that 256 has to be multiplied by the itemsize of each array, so the
chunksize is currently 2048 bytes for 64-bit items (int64 or float64)
and 4096 for double precision complex arrays, which are probably the
sizes that have to be compared with VML.
> > Share your experience
> > =====================
> > Let us know of any bugs, suggestions, gripes, kudos, etc. you may
> > have.
> I was missing the support for single precision floats.
Yeah. This is because nobody has implemented it before, but it is
> Great work!
You are welcome! And thanks for excellent feedback too! Hope we can
have a VML-aware numexpr anytime soon ;-)
More information about the Numpy-discussion