[SciPy-dev] fftw2 and fftw3

Matt Fishburn fishburn at MIT.EDU
Thu Dec 15 13:41:36 CST 2005

Arnd Baecker wrote:
> Hi,
> now also the results for the fftw2 vs. fftw3 comparison on the
> Itanium 2 are there.
> There is the same problem as on the other machines, for example:

I coded the original FFTW3 patches, but these were my first patches to 
OSS and I still don't completely understand scipy's build 
process/tradeoff philosophy.  If no one else wants to take a look at the 
FFTW stuff I guess you're stuck with me. :)

>> I am at complete loss of what is going on as I don't understand
>> fftpack at all.
>> Could an fft expert have a look or give some guidance on
>> how to explore this further?

In a nutshell, this is what I think is going on: FFTW plans out the 
transform before execution, pushing off as much of the execution as 
possible into the planning state.  FFTW3 plans have static memory 
locations for plans, whereas FFTW2 plans do not.  This allows FFTW2 
plans to be much more flexible than FFTW3 plans.  I believe the 
rationale for the plan/memory decision has to do with various speedups 
FFTW3 supports, such as SSE, but I'm not positive on this.

This planning difference requires FFTW3 to either cache the entire chunk 
of array memory, copying the array to the plan memory when the transform 
is operating on the array, or come up with a new plan each time a 
transform is performed.  There are tradeoffs between caching/copying 
memory and plan generation time - for small arrays, the FFTW3 code 
should probably cache the memory as copying memory is cheaper than 
coming up with a new plan, while for large arrays (or very large arrays) 
the overhead from copying the memory may be poor compared to the plan 
generation time.

Currently, I believe scipy caches the plan and memory for real 
transforms, but memory and plans for complex plans is not cached.  This 
may explain the poor performance of complex FFTs versus real FFTs for 
fftw3, but would not explain why fftw3 is slower than fftw2.  I think 
the fftw3 versus fftw2 issue stems from the planning difference.

> This clearly shows that
> fftw3 is faster than fftw2 (in particular for "i" - inplace situation)
> (and not at a all a factor of 10 slower!).

Keep in mind that any fftw3 vs fftw2 data you get off of the fftw 
website may dance around the planning difference.  mflop data may be 
only for the actual transform, not including plan generation, so it 
would be possible for only a single transform (not including plan 
generation or memory copying) using fftw3 to appear faster.

As for scipy, I don't remember what scipy caches for fftw3 versus fftw2. 
  My theories are guesses from what I remember about the code, which I 
haven't seen for a few months, but I'll try to take a look at it this 
weekend or next weekend.  I do have final exams for college, though, and 
may not be able to get to this until Christmas break on the 21st.

-Matt Fishburn
who should really be studying for his Game Theory final

More information about the Scipy-dev mailing list