[Numpy-discussion] [OT] Starving CPUs article featured in IEEE's ComputingNow portal

Anne Archibald peridot.faceted@gmail....
Fri Mar 19 12:13:33 CDT 2010


On 18 March 2010 13:53, Francesc Alted <faltet@pytables.org> wrote:
> A Thursday 18 March 2010 16:26:09 Anne Archibald escrigué:
>> Speak for your own CPUs :).
>>
>> But seriously, congratulations on the wide publication of the article;
>> it's an important issue we often don't think enough about. I'm just a
>> little snarky because this exact issue came up for us recently - a
>> visiting astro speaker put it as "flops are free" - and so I did some
>> tests and found that even without optimizing for memory access, our
>> tasks are already CPU-bound:
>> http://lighthouseinthesky.blogspot.com/2010/03/flops.html
>
> Well, I thought that my introduction was enough to convince anybody about the
> problem, but forgot that you, the scientists, always try to demonstrate things
> experimentally :-/

Snrk. Well, technically, that is our job description...

> Seriously, your example is a clear example of what I'm recommending in the
> article, i.e. always try to use libraries that are already leverage the
> blocking technique (that is, taking advantage of both temporal and spatial
> locality).  Don't know about FFTW (never used it, sorry), but after having a
> look at its home page, I'm pretty convinced that its authors are very
> conscious about these techniques.

> Being said this, it seems that, in addition, you are applying the blocking
> technique yourself also: get the data in bunches (256 floating point elements,
> which fits perfectly well on modern L1 caches), apply your computation (in
> this case, FFTW) and put the result back in memory.  A perfect example of what
> I wanted to show to the readers so, congratulations! you made it without the
> need to read my article (so perhaps the article was not so necessary after all
> :-)

What I didn't go into in detail in the article was that there's a
trade-off of processing versus memory access available: we could
reduce the memory load by a factor of eight by doing interpolation on
the fly instead of all at once in a giant FFT. But that would cost
cache space and flops, and we're not memory-dominated.

One thing I didn't try, and should: running four of these jobs at once
on a four-core machine. If I correctly understand the architecture,
that won't affect the cache issues, but it will effectively quadruple
the memory bandwidth needed, without increasing the memory bandwidth
available. (Which, honestly, makes me wonder what the point is of
building multicore machines.)

Maybe I should look into that interpolation stuff.

>> Heh. Indeed numexpr is a good tool for this sort of thing; it's an
>> unfortunate fact that simple use of numpy tends to do operations in
>> the pessimal order...
>
> Well, to honor the truth, NumPy does not have control in the order of the
> operations in expressions and how temporaries are managed: it is Python who
> decides that.  NumPy only can do what Python wants it to do, and do it as good
> as possible.  And NumPy plays its role reasonably well here, but of course,
> this is not enough for providing performance.  In fact, this problem probably
> affects to all interpreted languages out there, unless they implement a JIT
> compiler optimised for evaluating expressions --and this is basically what
> numexpr is.

I'm not knocking numpy; it does (almost) the best it can. (I'm not
sure of the optimality of the order in which ufuncs are executed; I
think some optimizations there are possible.) But a language designed
from scratch for vector calculations could certainly compile
expressions into a form that would save a lot of memory accesses,
particularly if an optimizer combined many lines of code. I've
actually thought about whether such a thing could be done in python; I
think the way to do it would be to build expression objects from
variable objects, then have a single "apply" function that fed values
in to all the variables. The same framework would support automatic
differentiation and other cool things, but I'm not sure it would be
useful enough to be worth the implementation complexity.

Anne


More information about the NumPy-Discussion mailing list