[Numpy-discussion] Numpy and OpenMP
Charles R Harris
Mon Mar 17 15:37:50 CDT 2008
On Mon, Mar 17, 2008 at 1:59 PM, Gnata Xavier <firstname.lastname@example.org>
> Francesc Altet wrote:
> > A Monday 17 March 2008, Christopher Barker escrigué:
> >>> > Plus a certain amount of numpy code depends on order of
> >>> > evaluation:
> >>> >
> >>> > a[:-1] = 2*a[1:]
> >> I'm confused here. My understanding of how it now works is that the
> >> above translates to:
> >> 1) create a new array (call it temp1) from a[1:], which shares a's
> >> data block.
> >> 2) create a temp2 array by multiplying 2 times each of the elements
> >> in temp1, and writing them into a new array, with a new data block 3)
> >> copy that temporary array into a[:-1]
> >> Why couldn't step (2) be parallelized? Why isn't it already with,
> >> BLAS? Doesn't BLAS must have such simple routines?
> > Probably yes, but the problem is that this kind of operations, namely,
> > vector-to-vector (usually found in the BLAS1 subset of BLAS), are
> > normally memory-bounded, so you can take little avantage from using
> > BLAS, most specially in modern processors, where the gap between the
> > CPU throughput and the memory bandwith is quite high (and increasing).
> > In modern machines, the use of BLAS is more interesting in vector-matrix
> > (BLAS2) computations, but definitely is in matrix-matrix (BLAS3) ones
> > (which is where the oportunities for cache reuse is higher) where the
> > speedups can really be very good.
> >> Also, maybe numexpr could benefit from this?
> > Maybe, but unfortunately it wouldn't be able to achieve high speedups.
> > Right now, numexpr is focused in accelerating mainly vector-vector
> > operations (or matrix-matrix, but element-wise, much like NumPy, so
> > that the cache cannot be reused), with some smart optimizations for
> > strided and unaligned arrays (in this scenario, it can be 2x or 3x
> > faster than NumPy, even for very simple operations like 'a+b').
> > In a similar way, OpenMP (or whatever parallel paradigm) will only
> > generally be useful when you have to deal with lots of data, and your
> > algorithm can have the oportunity to structure it so that small
> > portions of them can be reused many times.
> > Cheers,
> Well, linear alagera is another topic.
> What I can see from IDL (for innstance) is that it provides the user
> with a TOTAL function which take avantage of several CPU when the
> number of elements is large. It also provides a very simple way to set a
> max number of threads.
> I really really would like to see something like that in numpy (just to
> be able to tell somone "switch to numpy it is free and you will get
> exactly the same"). For now, I have a problem when they ask for //
> functions like TOTAL.
> For now, we can do that using C inline threaded code but it is *complex*
> and 2000x2000 images are now common. It is not a corner case any more.
Image processing may be a special in that many cases it is almost
embarrassingly parallel. Perhaps some special libraries for that sort of
application could be put together and just bits of c code be run on
different processors. Not that I know much about parallel processing, but
that would be my first take.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Numpy-discussion