[SciPy-dev] FFTW performances in scipy and numpy
Wed Aug 1 02:04:03 CDT 2007
On 01/08/07, David Cournapeau <firstname.lastname@example.org> wrote:
> I am one of the contributor to numpy/scipy. Let me first say I am
> *not* the main author of the fftw wrapping for scipy, and that I am a
> relatively newcommer in scipy, and do not claim a deep understanding of
> numpy arrays. But I have been thinking a bit on the problem since I am a
> big user of fft and debugged some problems in the scipy code since.
I have been using FFTW from raw c, and thinking about how it could be
used efficiently from python. (I also do oodles of FFTs in numpy, but
they are not performance-critical.)
> - about copying killing performances: I am well aware of the
> problem, this was only a quick hack because the performances were abysal
> before this hack (the plan was computed for every fft !), and I had some
> difficulties to follow the code for something better. At least, it made
> the performances acceptable.
How much faster are plans likely to get with FFTW_DESTROY? Is
copy+FFTW_DESTROY still much slower than FFTW_PRESERVE?
Multidimensional real ffts may require copying anyway, since they
don't support FFTW_PRESERVE (and the numpy fft interface doesn't
damage its input).
> - Because I found the code difficult to follow code, I started
> cleaning up the sources. The real goal is to add a better mechanism to
> use fftw as efficiently as possible.
It might be worth providing a thinner wrapper (as well as the current
simple one) so that one could control FFTW's planning from python
code. Perhaps a Plan object, which keeps track of (and optionally
allocates, with optimal alignment and striding) its input and output
> - making numpy data 16 bytes aligned. This one is a bit tricky. I
> won't bother you with the details, but generally, numpy data may not be
> even "word aligned". Since some archs require some kind of alignement,
> there are some mechanisms to get aligned buffers from unaligned buffers
> in numpy API; I was thinking about an additional flag "SIMD alignement",
> since this could be quite useful for many optimized libraries using
> SIMD. But maybe this does not make sense, I have not yet thought enough
> about it to propose anything concrete to the main numpy developers.
Not just libraries; with SSE2 and related instruction sets, it's quite
possible that even ufuncs could be radically accelerated - it's
reasonable to use SIMD (and cache control!) for even the simple case
of adding two arrays into a third. No code yet exists in numpy to do
so, but an aggressive optimizing compiler could do something with the
code that is there. (Of course, this has observable numerical effects,
so there would be the same problem as for gcc's -ffast-math flag.)
Really large numpy arrays are already going to be SIMD-aligned (on
Linux at least), because they are allocated on fresh pages. Small
arrays are going to waste space if they're SIMD-aligned. So the
default allocator is probably fine as it is, but it would be handy to
have alignment as an additional property one could request from
constructors and check from anywhere. I would hesitate to make it a
flag, since one might well care about page alignment, 32-bit
alignment, or whatever.
Really I suppose one can tell alignment pretty easily by looking at
the pointer (does FFTW_ESTIMATE do this automatically?) and IIRC one
can produce aligned arrays at the python level, if one is determined
to, so even without changes to the numpy C API this could be kluged
> - I have tried FFTW_UNALIGNED + FFTW_ESTIMATE plans; unfortunately,
> I found that the performances were worse than using FFTW_MEASURE + copy
> (the copies are done into aligned buffers). I have since discover that
> this may be due to the totally broken architecture of my main
> workstation (a Pentium four): on my recent macbook (On linux, 32 bits,
> CoreDuo2), using no copy with FFTW_UNALIGNED is much better.
> - The above problem is fixable if we add a mechanisme to choose
> plans (ESTIMATE vs MEASURE vs ... I found that for 1d cases at least,
> ESTIMATE vs MEASURE is what really count performance wise).
> - I have also tried to get two plans in parallel for each size (one
> SIMD, one not SIMD), but this does not work very well, because numpy
> arrays are almost never 16 bytes aligned, so this does not seem to worth
> the effort.
How much reuse of wisdom is there? Building a good wisdom database
during installation might make planning every time reasonable. What
properties of an array does the planner care about if used in guru
Anne M. Archibald
More information about the Scipy-dev