[SciPy-dev] FFTW performances in scipy and numpy
David Cournapeau
david@ar.media.kyoto-u.ac...
Wed Aug 1 02:49:34 CDT 2007
Anne Archibald wrote:
> On 01/08/07, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
>
>> I am one of the contributor to numpy/scipy. Let me first say I am
>> *not* the main author of the fftw wrapping for scipy, and that I am a
>> relatively newcommer in scipy, and do not claim a deep understanding of
>> numpy arrays. But I have been thinking a bit on the problem since I am a
>> big user of fft and debugged some problems in the scipy code since.
>
Ok, I prepared a small package to test several strategies:
http://www.ar.media.kyoto-u.ac.jp/members/david/archives/fftdev.tbz2
By doing make test, it should build out of the box and run the tests (if
you are on Linux, have gcc and fftw3, of course :) ). I did not even
check whether the computation is OK (I just tested against memory
problems under valgrind).
3 strategies are available:
- Have a flag to check whether the given array is 16 bytes aligned,
and conditionnally build plans using this info
- Use FFTW_UNALIGNED, and do not care about alignement
- Current strategy: copy.
The three strategies use FFTW_MEASURE, which I didn't do before, and may
explain the wrong things I've said in my precedent email.
There are four binaries, two for the first strategy: in one case
(testsimd), I use standard malloc, and in the second case, I force
malloc to be 16 bytes aligned.
The results on my pentium 4, for size of 1024, and 5000 iterations:
========================================
16 bytes aligned malloc + FFTW_MEASURE
========================================
size is 1024, iter is 5000
testing cached (estimate)
cycle is 780359600.00, 156071.92 per execution, min is 56816.00
countaligned is 5000
========================================
not aligned malloc + FFTW_MEASURE
========================================
size is 1024, iter is 5000
testing cached (estimate)
cycle is 948300378.00, 189660.08 per execution, min is 88216.00
countaligned is 0
========================================
not aligned malloc + FFTW_MEASURE | FFTW_UNALIGNED
========================================
size is 1024, iter is 5000
testing cached (estimate)
cycle is 1110396876.00, 222079.38 per execution, min is 87788.00
countaligned is 0
========================================
not aligned malloc + FFTW_MEASURE + copy
========================================
size is 1024, iter is 5000
testing cached (estimate)
cycle is 1644872686.00, 328974.54 per execution, min is 219936.00
countaligned is 0
TESTING Done
min is what matters: it is the minum number of cycles within the 5000
iterations.
On my pentium m, I got totally different results of course:
51000 cycles
55000 cycles
55000 cycles
150000 cycles
And also, those are almost always reproducible (do not change much
between runs). I don't know if this is because of the Pentium 4 of my
workstation, or because I have the home on NFS which may screw things up
in a way I don't understand, but on my workstation, each run gives
different results.
Basically, we should change the current scipy implementation now, as the
third strategy is easy to implement, and is 3 times better than the
current one :)
>
> Not just libraries; with SSE2 and related instruction sets, it's quite
> possible that even ufuncs could be radically accelerated - it's
> reasonable to use SIMD (and cache control!) for even the simple case
> of adding two arrays into a third. No code yet exists in numpy to do
> so, but an aggressive optimizing compiler could do something with the
> code that is there. (Of course, this has observable numerical effects,
> so there would be the same problem as for gcc's -ffast-math flag.)
The problem of precision really is specific to SSE and x86, right ? But
since apple computers also use those now, I guess the problem is kind of
pervasive :)
>
> Really large numpy arrays are already going to be SIMD-aligned (on
> Linux at least), because they are allocated on fresh pages. Small
> arrays are going to waste space if they're SIMD-aligned. So the
> default allocator is probably fine as it is, but it would be handy to
> have alignment as an additional property one could request from
> constructors and check from anywhere. I would hesitate to make it a
> flag, since one might well care about page alignment, 32-bit
> alignment, or whatever.
Are you sure about the page thing ? A page is 4kb, right ? This would
mean any double numpy arrays above 512 items is aligned... which is not
what I observed when I tested. Since I screwed things up last time I
checked, I should test again, though.
>
> Really I suppose one can tell alignment pretty easily by looking at
> the pointer (does FFTW_ESTIMATE do this automatically?)
This is trivial: just check whether your pointer address is a multiple
of 16 (one line of code in my benchmark, in zfft_fftw3.c)
David
More information about the Scipy-dev
mailing list