[SciPy-dev] FFTW performances in scipy and numpy
Wed Aug 1 00:13:43 CDT 2007
I am one of the contributor to numpy/scipy. Let me first say I am
*not* the main author of the fftw wrapping for scipy, and that I am a
relatively newcommer in scipy, and do not claim a deep understanding of
numpy arrays. But I have been thinking a bit on the problem since I am a
big user of fft and debugged some problems in the scipy code since.
- about copying killing performances: I am well aware of the
problem, this was only a quick hack because the performances were abysal
before this hack (the plan was computed for every fft !), and I had some
difficulties to follow the code for something better. At least, it made
the performances acceptable.
- Because I found the code difficult to follow code, I started
cleaning up the sources. The real goal is to add a better mechanism to
use fftw as efficiently as possible.
To improve performances: I thought about several approaches, which
happen to be the ones you suggest :)
- making numpy data 16 bytes aligned. This one is a bit tricky. I
won't bother you with the details, but generally, numpy data may not be
even "word aligned". Since some archs require some kind of alignement,
there are some mechanisms to get aligned buffers from unaligned buffers
in numpy API; I was thinking about an additional flag "SIMD alignement",
since this could be quite useful for many optimized libraries using
SIMD. But maybe this does not make sense, I have not yet thought enough
about it to propose anything concrete to the main numpy developers.
- I have tried FFTW_UNALIGNED + FFTW_ESTIMATE plans; unfortunately,
I found that the performances were worse than using FFTW_MEASURE + copy
(the copies are done into aligned buffers). I have since discover that
this may be due to the totally broken architecture of my main
workstation (a Pentium four): on my recent macbook (On linux, 32 bits,
CoreDuo2), using no copy with FFTW_UNALIGNED is much better.
- The above problem is fixable if we add a mechanisme to choose
plans (ESTIMATE vs MEASURE vs ... I found that for 1d cases at least,
ESTIMATE vs MEASURE is what really count performance wise).
- I have also tried to get two plans in parallel for each size (one
SIMD, one not SIMD), but this does not work very well, because numpy
arrays are almost never 16 bytes aligned, so this does not seem to worth
If you are interested in more concrete results/code, I can take a look
at my test programs, make them buildable, and make them available for
your comments (the tests programs do not depend on python, they are pure
C code using directly the C wrapping used by scipy).
More information about the Scipy-dev