[SciPy-dev] fftw2 and fftw3

Arnd Baecker arnd.baecker at web.de
Mon Dec 19 04:55:20 CST 2005


On Thu, 15 Dec 2005, Matt Fishburn wrote:

> Arnd Baecker wrote:
> > Hi,
> >
> > now also the results for the fftw2 vs. fftw3 comparison on the
> > Itanium 2 are there.
> >
> > There is the same problem as on the other machines, for example:
> >
> ...
>
> I coded the original FFTW3 patches, but these were my first patches to
> OSS and I still don't completely understand scipy's build
> process/tradeoff philosophy.  If no one else wants to take a look at the
> FFTW stuff I guess you're stuck with me. :)

Being "stuck" with the original coder is not the worst thing to
happen ;-)
I am sure that Pearu can help with the build aspects
(though I don't think that it is really a build issue).
Note, that with recent changes in svn  by Pearu
it is possible to build fftpack without installing all of full scipy.
This is very convenient for testing!

> >> I am at complete loss of what is going on as I don't understand
> >> fftpack at all.
> >> Could an fft expert have a look or give some guidance on
> >> how to explore this further?
>
> In a nutshell, this is what I think is going on: FFTW plans out the
> transform before execution, pushing off as much of the execution as
> possible into the planning state.  FFTW3 plans have static memory
> locations for plans, whereas FFTW2 plans do not.  This allows FFTW2
> plans to be much more flexible than FFTW3 plans.  I believe the
> rationale for the plan/memory decision has to do with various speedups
> FFTW3 supports, such as SSE, but I'm not positive on this.
>
> This planning difference requires FFTW3 to either cache the entire chunk
> of array memory, copying the array to the plan memory when the transform
> is operating on the array, or come up with a new plan each time a
> transform is performed.  There are tradeoffs between caching/copying
> memory and plan generation time - for small arrays, the FFTW3 code
> should probably cache the memory as copying memory is cheaper than
> coming up with a new plan, while for large arrays (or very large arrays)
> the overhead from copying the memory may be poor compared to the plan
> generation time.

I am not quite sure about this - it seems that planning
times can become very large for large arrays...

> Currently, I believe scipy caches the plan and memory for real
> transforms, but memory and plans for complex plans is not cached.  This
> may explain the poor performance of complex FFTs versus real FFTs for
> fftw3,

OK, that would be one important improvement.
I am not sure If you have read the mail
  http://www.scipy.org/mailinglists/mailman?fn=scipy-user/2005-December/006170.html
on scipy-user, where Darren Dale discusses similar problems,
and some more benchmarks on this ar given.
In particular, don't miss the pics ;-)
http://www.physik.tu-dresden.de/~baecker/tmp/fftw/test_AB.png
http://www.physik.tu-dresden.de/~baecker/tmp/fftw/test_DD.png

So I really think that the complex vs. real difference
is the most important thing.

> but would not explain why fftw3 is slower than fftw2.  I think
> the fftw3 versus fftw2 issue stems from the planning difference.

OK, that might be the next step to tackle ...

> > This clearly shows that
> > fftw3 is faster than fftw2 (in particular for "i" - inplace situation)
> > (and not at a all a factor of 10 slower!).
>
> Keep in mind that any fftw3 vs fftw2 data you get off of the fftw
> website

Well, these don't help too much anyway, because
the benchmarks are for different machines.
That's why one has to use benchfft - or, with fftw3, the
 `tests/bench` programm

> may dance around the planning difference.  mflop data may be
> only for the actual transform, not including plan generation, so it
> would be possible for only a single transform (not including plan
> generation or memory copying) using fftw3 to appear faster.

Some hints on this may be gotten by
  cd fftw-3.0.1/tests

  ./bench   -s icf8192
  Problem: icf8192, setup: 476.52 ms, time: 202.25 us, ``mflops'': 2632.8
   ./bench  -opatient -s icf8192
  Problem: icf8192, setup: 4.29 s, time: 198.00 us, ``mflops'': 2689.3
  ./bench  -oexhaustive -s icf8192
  Problem: icf8192, setup: 28.92 s, time: 197.00 us, ``mflops'': 2702.9

(different machine than for the plots at www.ph..., Itanium 2):

> As for scipy, I don't remember what scipy caches for fftw3 versus fftw2.
>   My theories are guesses from what I remember about the code, which I
> haven't seen for a few months, but I'll try to take a look at it this
> weekend or next weekend.

That would be fantastic!!

> I do have final exams for college, though, and
> may not be able to get to this until Christmas break on the 21st.

In the long run it would be nice if we could get support
for all the FFTs from ACML, IMKL, scsl, ...
Looking at the code for fftpack, this would mean to add many
more #if's to  src/fftpack.h and src/zfft.c  - not sure
if that will become unreadable ...
A collegue of mine just told me that the IMKL FFT has
an interface which is pretty similar to the fftw2 one,
so this might be a lower hanging fruit ...

Many thanks, looking forward to your findings!!

Best, Arnd




More information about the Scipy-dev mailing list