[SciPy-dev] FFTW performances in scipy and numpy
Wed Aug 1 10:42:57 CDT 2007
On 01/08/07, David Cournapeau <email@example.com> wrote:
> Anne Archibald wrote:
> > Not just libraries; with SSE2 and related instruction sets, it's quite
> > possible that even ufuncs could be radically accelerated - it's
> > reasonable to use SIMD (and cache control!) for even the simple case
> > of adding two arrays into a third. No code yet exists in numpy to do
> > so, but an aggressive optimizing compiler could do something with the
> > code that is there. (Of course, this has observable numerical effects,
> > so there would be the same problem as for gcc's -ffast-math flag.)
> The problem of precision really is specific to SSE and x86, right ? But
> since apple computers also use those now, I guess the problem is kind of
> pervasive :)
I think some other architectures (MIPS? not sure) may also use an
intermediate representation with more accuracy. As you say, though,
x86 and x86-64 are fairly pervasive.
BLAS would of course probably be faster (though how well does it cope
with peculiarly-strided data?) but I expect resistance to making numpy
depend on BLAS.
> > Really large numpy arrays are already going to be SIMD-aligned (on
> > Linux at least), because they are allocated on fresh pages. Small
> > arrays are going to waste space if they're SIMD-aligned. So the
> > default allocator is probably fine as it is, but it would be handy to
> > have alignment as an additional property one could request from
> > constructors and check from anywhere. I would hesitate to make it a
> > flag, since one might well care about page alignment, 32-bit
> > alignment, or whatever.
> Are you sure about the page thing ? A page is 4kb, right ? This would
> mean any double numpy arrays above 512 items is aligned... which is not
> what I observed when I tested. Since I screwed things up last time I
> checked, I should test again, though.
By "really large" I don't necessarily mean "larger than a page"; I
don't know what malloc's threshold is. I had in mind the 300-MB arrays
I'm allocating, which are definitely on fresh pages (which allows
malloc to dump them back to the OS when they get freed).
More information about the Scipy-dev