[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Sat Mar 22 13:54:03 CDT 2008
On 22/03/2008, Travis E. Oliphant <email@example.com> wrote:
> James Philbin wrote:
> > Personally, I think that the time would be better spent optimizing
> > routines for single-threaded code and relying on BLAS and LAPACK
> > libraries to use multiple cores for more complex calculations. In
> > particular, doing some basic loop unrolling and SSE versions of the
> > ufuncs would be beneficial. I have some experience writing SSE code
> > using intrinsics and would be happy to give it a shot if people tell
> > me what functions I should focus on.
> Fabulous! This is on my Project List of todo items for NumPy. See
> http://projects.scipy.org/scipy/numpy/wiki/ProjectIdeas I should spend
> some time refactoring the ufunc loops so that the templating does not
> get in the way of doing this on a case by case basis.
> 1) You should focus on the math operations: add, subtract, multiply,
> divide, and so forth.
> 2) Then for "combined operations" we should expose the functionality at
> a high-level. So, that somebody could write code to take advantage of it.
> It would be easiest to use intrinsics which would then work for AMD,
> Intel, on multiple compilers.
I think even heavier use of code generation would be a good idea here.
There are so many different versions of each loop, and the fastest way
to run each one is going to be different for different versions and
different platforms, that a routine that assembled the code from
chunks and picked the fastest combination for each instance might make
a big difference - this is roughly what FFTW and ATLAS do.
There are also some optimizations to be made at a higher level that
might give these optimizations more traction. For example:
A = randn(100*100)
A.shape = (100,100)
There's no reason the multiply ufunc couldn't flatten A and use a
single unstrided loop to do the multiplication.
More information about the Numpy-discussion