[SciPy-dev] FFTW performances in scipy and numpy

David Cournapeau david@ar.media.kyoto-u.ac...
Wed Aug 1 00:13:43 CDT 2007

Hi Stevens,

    I am one of the contributor to numpy/scipy. Let me first say I am 
*not* the main author of the fftw wrapping for scipy, and that I am a 
relatively newcommer in scipy, and do not claim a deep understanding of 
numpy arrays. But I have been thinking a bit on the problem since I am a 
big user of fft and debugged some problems in the scipy code since.
    - about copying killing performances: I am well aware of the 
problem, this was only a quick hack because the performances were abysal 
before this hack (the plan was computed for every fft !), and I had some 
difficulties to follow the code for something better. At least, it made 
the performances acceptable.
    - Because I found the code difficult to follow code, I started 
cleaning up the sources. The real goal is to add a better mechanism to 
use fftw as efficiently as possible.

To improve performances: I thought about several approaches, which 
happen to be the ones you suggest :)

    - making numpy data 16 bytes aligned. This one is a bit tricky. I 
won't bother you with the details, but generally, numpy data may not be 
even "word aligned". Since some archs require some kind of alignement, 
there are some mechanisms to get aligned buffers from unaligned buffers 
in numpy API; I was thinking about an additional flag "SIMD alignement", 
since this could be quite useful for many optimized libraries using 
SIMD. But maybe this does not make sense, I have not yet thought enough 
about it to propose anything concrete to the main numpy developers.
    - I have tried FFTW_UNALIGNED + FFTW_ESTIMATE plans; unfortunately, 
I found that the performances were worse than using FFTW_MEASURE + copy 
(the copies are done into aligned buffers). I have since discover that 
this may be due to the totally broken architecture of my main 
workstation (a Pentium four): on my recent macbook (On linux, 32 bits, 
CoreDuo2), using no copy with FFTW_UNALIGNED is much better.
    - The above problem is fixable if we add a mechanisme to choose 
plans (ESTIMATE vs MEASURE vs ... I found that for 1d cases at least, 
ESTIMATE vs MEASURE is what really count performance wise).
    - I have also tried to get two plans in parallel for each size (one 
SIMD, one not SIMD), but this does not work very well, because numpy 
arrays are almost never 16 bytes aligned, so this does not seem to worth 
the effort.

If you are interested in more concrete results/code, I can take a look 
at my test programs, make them buildable, and make them available for 
your comments (the tests programs do not depend on python, they are pure 
C code using directly the C wrapping used by scipy).



More information about the Scipy-dev mailing list