[Numpy-discussion] NumPy re-factoring project
Thu Jun 10 16:56:56 CDT 2010
Den 10.06.2010 22:28, skrev Pauli Virtanen:
> Some places where Openmp could probably help are in the inner ufunc
> loops. However, improving the memory efficiency of the data access
> pattern is another low-hanging fruit for multidimensional arrays.
Getting the intermediate array out of expressions like a + b would be
very cool. But it is not as trivial as it seems.
Possibility of controlling aliasing (cf. Fortran and C99) can help. For
example, for binary array operators we know that the output array (a
temporary intermediate) is not aliased with the two arguments.
ufuncs involving trancendental functions would benefit from OpenMP.
O(N**2) ufuncs like var and std would also benefit.
We can also use copy-in copy-out for strided array access (see below).
> I'm not sure how Openmp plays together with thread-safety; especially if
> used in outer constructs. Whether this is possible to do easily is not so
> clear, as Numpy uses e.g. iterator and loop objects, which cannot
> probably be shared between different openmp threads. So one would have to
> do some extra work to parallelize these parts manually.
OpenMP has pragmas for serializing access to code, which is:
#pragma omp critical
// done by all threads
// serialized access (we can also name the lock)
#pragma omp atomic
a += 1 // atomic operation (e.g. nice for refcounts, instead a big lock)
#pragma omp single
// done by one thread (random or named)
#pragma omp master
// done by the master thread only
// e.g. used when calling back to the Python C API
#pragma omp barrier
// wait for all threads here
That's about all you need to know about thread synchronization in OpenMP...
> But probably if multithreading is desired, openmp sounds like the way to
Anyone profient in C or Fortran can learn OpenMP in 10 minutes, and it
is close to infinitely more easy than manual multithreading. And now it
is supported by the majority of compilers.
Also about array iterators in NumPy's C base (i.e. for doing something
along an axis): we don't need those. There is a different way of coding
which leads to faster code.
1. Collect an array of pointers to each subarray (e.g. using
std::vector<dtype*> or dtype**)
2. Dispatch on the pointer array...
I used this to reimplement lfilter in scipy. It is just a lot faster
than array iterators. It also allows us to use OpenMP in a very natural
way. For multidimensional array I have a recursive function for
collecting the array of subarray pointers, but for 2D and 3D with can
use nested for-loops.
Another thing I did when reimplementing lfilter was "copy-in copy-out"
for strided arrays. This also helps a lot. It might seem paradoxical as
we need to iterate over the strided array(s) to make the copies. But in
practice it is faster by a factor of 2 or 3. Fortran compilers also tend
to do this optimization for strided arrays. It is even mentioned in the
Fortran standard as explicitely allowed, as it has consequences for
array pointers (they can be invalidated by function calls if a
contigiuous copy is made).
More information about the NumPy-Discussion