[Numpy-discussion] Numpy and OpenMP

Damian Eads eads@soe.ucsc....
Sat Mar 15 20:25:59 CDT 2008


Anne,

Sure. I've found multi-threaded scientific computation to give mixed 
results. For some things, it results in very significant performance 
gains, and other things, it's not worth the trouble at all. It really 
does depend on what you're doing. But, I don't think it's fair to paint 
multithreaded programming with the same brush just because there exist 
pathologies.

Robert: what benchmarks were performed showing less than pleasing 
performance gains?

Anne Archibald wrote:
> On 15/03/2008, Damian Eads <eads@soe.ucsc.edu> wrote:
>> Robert Kern wrote:
>>  > Eric Jones tried to use multithreading to split the computation of
>>  > ufuncs across CPUs. Ultimately, the overhead of locking and unlocking
>>  > made it prohibitive for medium-sized arrays and only somewhat
>>  > disappointing improvements in performance for quite large arrays. I'm
>>  > not familiar enough with OpenMP to determine if this result would be
>>  > applicable to it. If you would like to try, we can certainly give you
>>  > pointers as to where to start.
>>
>> Perhaps I'm missing something. How is locking and synchronization an
>>  issue when each thread is writing to a mutually exclusive part of the
>>  output buffer?
> 
> The trick is to efficiently allocate these output buffers. If you
> simply give each thread 1/n th of the job, if one CPU is otherwise
> occupied it doubles your computation time. If you break the job into
> many pieces and let threads grab them, you need to worry about locking
> to keep two threads from grabbing the same piece of data.

For element-wise unary and binary array operations, there would never be 
two threads reading from the same memory at the same time. When 
performing matrix multiplication, more than two threads will access the 
same memory but this is fine as long as their accesses are read-only. 
The moment there is a chance one thread might need to write to the same 
buffer that one or more threads are reading from, use a read/write lock 
(pthreads supports this).

As far as coordinating the work for the threads, there are several 
possible approaches (this is not a complete list):

   1. assign to each of them the part of the buffer to work on 
beforehand. This assumes each thread will compute at the same rate and 
will finish the same chunk roughly in the same amount of time. This is 
not always a valid assumption.

   2. assign smaller chunks, leaving a large amount of unassigned work. 
As threads complete computation of a chunk, assign them another chunk. 
This requires some memory to keep track of the chunks assigned and 
unassigned. Since it is possible for multiple threads to try to access 
(with at least one modifying thread) this chunk assignment structure at 
the same time, you need synchronization. In some cases, the overhead for 
doing this synchronization is minimal.

   3. use approach #2 but assign chunk sizes of random sizes to reduce 
contention between threads trying to access the chunk assignment 
structure at the same time.

   4. for very large jobs, have a chunk assignment server. Some of my 
experiments take several weeks and are spread across 64 processors (8 
machines, 8 processors per machine). Individual units of computation 
take anywhere from 30 minutes to 8 hours. The cost of asking the chunk 
assignment server for a new chunk are minimal relative to the amount of 
time it takes to compute on the chunk. By not assigning all the 
computation up front in the beginning, most processors are working 
nearly all the time. It's only during the last day or two of the 
experiment, do there exist processors with nothing to do.

> Plus,
> depending on where things are in memory you can kill performance by
> abusing the caches (maintaining cache consistency across CPUs can be a
> challenge). Plus a certain amount of numpy code depends on order of
> evaluation:
> 
> a[:-1] = 2*a[1:]

Yes, but there are many, many instances when the order of evaluation in 
an array is sequential. I'm not advocating that numpy tool be devised to 
handle the parallelization of arbitrary computation, just common kinds 
of computation where performance gains might be realized.

> Correctly handling all this can take a lot of overhead, and require a
> lot of knowledge about hardware. OpenMP tries to take care of some of
> this in a way that's easy on the programmer.
> 
> To answer the OP's question, there is a relatively small number of C
> inner loops that could be marked up with OpenMP #pragmas to cover most
> matrix operations. Matrix linear algebra is a separate question, since
> numpy/scipy prefers to use optimized third-party libraries - in these
> cases one would need to use parallel linear algebra libraries (which
> do exist, I think, and are plug-compatible).  So parallelizing numpy is
> probably feasible, and probably not too difficult, and would be
> valuable.

Yes, but there is a limit to the parallelization that can be achieved 
with vanilla numpy. numpy evaluates Python expressions, one at a time; 
thus, expressions like

       sqrt(0.5 * B * C + D * (E + F))

are not well-parallelized and they waste scratch space. One workaround 
is having the __add__ and __mul__ functions return place-holder objects, 
instead of doing the actual computation.

      A = sqrt(0.5 * B * C + D * (E + F)).compute()

Then invoke .compute() on the outermost placeholder object to perform 
the computation in a parallelized fashion. What .compute does is a big 
open question.

One possibility is to generate C code and run it. For example, the 
Python expression above might result in the following C code:

   for (i = chunk_start; i < chunk_end; i++) {
      A[i] = sqrt(0.5 * B[i] * C[i] * D[i] * (E[i] + F[i]));
   }

Each thread is given a different value for chunk_start and chunk_end. Of 
course, it's desired that each of the input matrices B, C, D, E, F are 
contiguous for good use of the cache. There are many possibilities about 
what's done with the placeholder objects.

The issue of thread contention on chunk assignment data structures is 
valid. In some cases, the overhead may be minimal. In other cases, there 
are strategies one can employ to reduce contention.

Damian


More information about the Numpy-discussion mailing list