[Numpy-discussion] fortran array storage question
Fri Oct 26 17:14:25 CDT 2007
On 26/10/2007, Travis E. Oliphant <email@example.com> wrote:
> There is an optimization where-in the inner-loops are done over the
> dimension with the smallest stride.
> What other cache-coherent optimizations do you recommend?
That sounds like a very good first step. I'm far from an expert on
this sort of thing, but here are a few ideas at random:
* internally flattening arrays when this doesn't affect the result
* prefetching memory: in a C application I recently wrote, explicitly
prefetching data for interpolation cut my runtime by 30%. This
includes telling the processor when you're done with data so it can be
purged from the cache.
* aligning (some) arrays to 8- 16- 32- or 64-byte boundaries so that
they divide nicely into cache lines
* using MMX/SSE instructions when available
* combining ufuncs so that computations can keep the CPU busy while it
waits for data to come in from main RAM (I realize that this is
properly the domain of numexpr)
* using ATLAS- or FFTW-style autotuning to determine the fastest ways
to structure computations (again more relevant for significant
expressions rather than simple ufuncs)
* reducing use of temporaries in the interest of reducing traffic to main memory
* openmp parallel operations when this actually speeds up calculation
I realize most of these are a lot of work, and some of them are
probably in numpy already. Moreover without using an expression parser
it's probably not feasible to implement others. But an array language
offers the possibility that a runtime can implement all sorts of
optimizations without effort on the user's part.
More information about the Numpy-discussion