[Numpy-discussion] Proposed Roadmap Overview
Dag Sverre Seljebotn
Mon Feb 20 12:08:50 CST 2012
On 02/20/2012 09:34 AM, Christopher Jordan-Squire wrote:
> On Mon, Feb 20, 2012 at 9:18 AM, Dag Sverre Seljebotn
> <firstname.lastname@example.org> wrote:
>> On 02/20/2012 08:55 AM, Sturla Molden wrote:
>>> Den 20.02.2012 17:42, skrev Sturla Molden:
>>>> There are still other options than C or C++ that are worth considering.
>>>> One would be to write NumPy in Python. E.g. we could use LLVM as a
>>>> JIT-compiler and produce the performance critical code we need on the fly.
>>> LLVM and its C/C++ frontend Clang are BSD licenced. It compiles faster
>>> than GCC and often produces better machine code. They can therefore be
>>> used inside an array library. It would give a faster NumPy, and we could
>>> keep most of it in Python.
>> I think it is moot to focus on improving NumPy performance as long as in
>> practice all NumPy operations are memory bound due to the need to take a
>> trip through system memory for almost any operation. C/C++ is simply
>> "good enough". JIT is when you're chasing a 2x improvement or so, but
>> today NumPy can be 10-20x slower than a Cython loop.
> I don't follow this. Could you expand a bit more? (Specifically, I
> wasn't aware that numpy could be 10-20x slower than a cython loop, if
> we're talking about the base numpy library--so core operations. I'm
The problem with NumPy is the temporaries needed -- if you want to compute
A + B + np.sqrt(D)
then, if the arrays are larger than cache size (a couple of megabytes),
then each of those operations will first transfer the data in and out
over the memory bus. I.e. first you compute an element of sqrt(D), then
the result of that is put in system memory, then later the same number
is read back in order to add it to an element in B, and so on.
The compute-to-bandwidth ratio of modern CPUs is between 30:1 and
60:1... so in extreme cases it's cheaper to do 60 additions than to
transfer a single number from system memory.
It is much faster to only transfer an element (or small block) from each
of A, B, and D to CPU cache, then do the entire expression, then
transfer the result back. This is easy to code in Cython/Fortran/C and
impossible with NumPy/Python.
This is why numexpr/Theano exists.
You can make the slowdown over Cython/Fortran/C almost arbitrarily large
by adding terms to the equation above. So of course, the actual slowdown
depends on your usecase.
> also not totally sure why a JIT is a 2x improvement or so vs. cython.
> Not that a disagree on either of these points, I'd just like a bit
> more detail.)
I meant that the JIT may be a 2x improvement over the current NumPy C
code. There's some logic when iterating arrays that could perhaps be
specialized away depending on the actual array layout at runtime.
But I'm thinking that probably a JIT wouldn't help all that much, so
it's probably 1x -- the 2x was just to be very conservative w.r.t. the
argument I was making, as I don't know the NumPy C sources well enough.
More information about the NumPy-Discussion