[SciPy-dev] fftw2 and fftw3
Matt Fishburn
fishburn at MIT.EDU
Thu Dec 15 13:41:36 CST 2005
Arnd Baecker wrote:
> Hi,
>
> now also the results for the fftw2 vs. fftw3 comparison on the
> Itanium 2 are there.
>
> There is the same problem as on the other machines, for example:
>
...
I coded the original FFTW3 patches, but these were my first patches to
OSS and I still don't completely understand scipy's build
process/tradeoff philosophy. If no one else wants to take a look at the
FFTW stuff I guess you're stuck with me. :)
>> I am at complete loss of what is going on as I don't understand
>> fftpack at all.
>> Could an fft expert have a look or give some guidance on
>> how to explore this further?
In a nutshell, this is what I think is going on: FFTW plans out the
transform before execution, pushing off as much of the execution as
possible into the planning state. FFTW3 plans have static memory
locations for plans, whereas FFTW2 plans do not. This allows FFTW2
plans to be much more flexible than FFTW3 plans. I believe the
rationale for the plan/memory decision has to do with various speedups
FFTW3 supports, such as SSE, but I'm not positive on this.
This planning difference requires FFTW3 to either cache the entire chunk
of array memory, copying the array to the plan memory when the transform
is operating on the array, or come up with a new plan each time a
transform is performed. There are tradeoffs between caching/copying
memory and plan generation time - for small arrays, the FFTW3 code
should probably cache the memory as copying memory is cheaper than
coming up with a new plan, while for large arrays (or very large arrays)
the overhead from copying the memory may be poor compared to the plan
generation time.
Currently, I believe scipy caches the plan and memory for real
transforms, but memory and plans for complex plans is not cached. This
may explain the poor performance of complex FFTs versus real FFTs for
fftw3, but would not explain why fftw3 is slower than fftw2. I think
the fftw3 versus fftw2 issue stems from the planning difference.
> This clearly shows that
> fftw3 is faster than fftw2 (in particular for "i" - inplace situation)
> (and not at a all a factor of 10 slower!).
Keep in mind that any fftw3 vs fftw2 data you get off of the fftw
website may dance around the planning difference. mflop data may be
only for the actual transform, not including plan generation, so it
would be possible for only a single transform (not including plan
generation or memory copying) using fftw3 to appear faster.
As for scipy, I don't remember what scipy caches for fftw3 versus fftw2.
My theories are guesses from what I remember about the code, which I
haven't seen for a few months, but I'll try to take a look at it this
weekend or next weekend. I do have final exams for college, though, and
may not be able to get to this until Christmas break on the 21st.
-Matt Fishburn
who should really be studying for his Game Theory final
More information about the Scipy-dev
mailing list