[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Sun Mar 23 06:03:51 CDT 2008
Travis E. Oliphant wrote:
> Anne Archibald wrote:
>> On 22/03/2008, Travis E. Oliphant <firstname.lastname@example.org> wrote:
>>> James Philbin wrote:
>>> > Personally, I think that the time would be better spent optimizing
>>> > routines for single-threaded code and relying on BLAS and LAPACK
>>> > libraries to use multiple cores for more complex calculations. In
>>> > particular, doing some basic loop unrolling and SSE versions of the
>>> > ufuncs would be beneficial. I have some experience writing SSE code
>>> > using intrinsics and would be happy to give it a shot if people tell
>>> > me what functions I should focus on.
>>> Fabulous! This is on my Project List of todo items for NumPy. See
>>> http://projects.scipy.org/scipy/numpy/wiki/ProjectIdeas I should spend
>>> some time refactoring the ufunc loops so that the templating does not
>>> get in the way of doing this on a case by case basis.
>>> 1) You should focus on the math operations: add, subtract, multiply,
>>> divide, and so forth.
>>> 2) Then for "combined operations" we should expose the functionality at
>>> a high-level. So, that somebody could write code to take advantage of it.
>>> It would be easiest to use intrinsics which would then work for AMD,
>>> Intel, on multiple compilers.
>> I think even heavier use of code generation would be a good idea here.
>> There are so many different versions of each loop, and the fastest way
>> to run each one is going to be different for different versions and
>> different platforms, that a routine that assembled the code from
>> chunks and picked the fastest combination for each instance might make
>> a big difference - this is roughly what FFTW and ATLAS do.
>> There are also some optimizations to be made at a higher level that
>> might give these optimizations more traction. For example:
>> A = randn(100*100)
>> A.shape = (100,100)
>> There's no reason the multiply ufunc couldn't flatten A and use a
>> single unstrided loop to do the multiplication.
> Good idea, it does already do that :-) The ufunc machinery is also a
> good place for an optional thread pool.
> Perhaps we could drum up interest in a Need for Speed Sprint on NumPy
> sometime over the next few months.
> -Travis O.
I have a very limited knowledge of openmp but please consider this
#define N 100000000
data = malloc(N*sizeof(double));
#pragma omp parallel for
#pragma omp parallel for
gcc -fopenmp -Wall -lm -O3 sin.c -o sinopenmp and gcc -fopenmp -Wall
-lm -O3 sin.c -o sin
On my core2duo :
It scales very well :) (gcc-4.2). It would be so nice to see that usign
Ok it is a very simple case but numpy.sin(a) is such a case isn't it ??
Please give it a try ;)
More information about the Numpy-discussion