[Numpy-discussion] Numpy and OpenMP

Gnata Xavier xavier.gnata@gmail....
Mon Mar 17 18:03:22 CDT 2008


Charles R Harris wrote:
>
>
> On Mon, Mar 17, 2008 at 1:59 PM, Gnata Xavier <xavier.gnata@gmail.com 
> <mailto:xavier.gnata@gmail.com>> wrote:
>
>     Francesc Altet wrote:
>     > A Monday 17 March 2008, Christopher Barker escrigué:
>     >
>     >>>  > Plus a certain amount of numpy code depends on order of
>     >>>  > evaluation:
>     >>>  >
>     >>>  > a[:-1] = 2*a[1:]
>     >>>
>     >> I'm confused here. My understanding of how it now works is that the
>     >> above translates to:
>     >>
>     >> 1) create a new array (call it temp1) from a[1:], which shares a's
>     >> data block.
>     >> 2) create a temp2 array by multiplying 2 times each of the elements
>     >> in temp1, and writing them into a new array, with a new data
>     block 3)
>     >> copy that temporary array into a[:-1]
>     >>
>     >> Why couldn't step (2) be parallelized? Why isn't it already with,
>     >> BLAS? Doesn't BLAS must have such simple routines?
>     >>
>     >
>     > Probably yes, but the problem is that this kind of operations,
>     namely,
>     > vector-to-vector (usually found in the BLAS1 subset of BLAS), are
>     > normally memory-bounded, so you can take little avantage from using
>     > BLAS, most specially in modern processors, where the gap between the
>     > CPU throughput and the memory bandwith is quite high (and
>     increasing).
>     > In modern machines, the use of BLAS is more interesting in
>     vector-matrix
>     > (BLAS2) computations, but definitely is in matrix-matrix (BLAS3)
>     ones
>     > (which is where the oportunities for cache reuse is higher)
>     where the
>     > speedups can really be very good.
>     >
>     >
>     >> Also, maybe numexpr could benefit from this?
>     >>
>     >
>     > Maybe, but unfortunately it wouldn't be able to achieve high
>     speedups.
>     > Right now, numexpr is focused in accelerating mainly vector-vector
>     > operations (or matrix-matrix, but element-wise, much like NumPy, so
>     > that the cache cannot be reused), with some smart optimizations for
>     > strided and unaligned arrays (in this scenario, it can be 2x or 3x
>     > faster than NumPy, even for very simple operations like 'a+b').
>     >
>     > In a similar way, OpenMP (or whatever parallel paradigm) will only
>     > generally be useful when you have to deal with lots of data, and
>     your
>     > algorithm can have the oportunity to structure it so that small
>     > portions of them can be reused many times.
>     >
>     > Cheers,
>     >
>     >
>
>     Well, linear alagera is another topic.
>
>     What I can see from IDL (for innstance) is that it provides the user
>     with a TOTAL function which take avantage  of several CPU when the
>     number of elements is large. It also provides a very simple way to
>     set a
>     max number of threads.
>
>     I really really would like to see something like that in numpy
>     (just to
>     be able to tell somone "switch to numpy it is free and you will get
>     exactly the same"). For now, I have a problem when they ask for //
>     functions like TOTAL.
>
>     For now, we can do that using C inline threaded code but it is
>     *complex*
>     and 2000x2000 images are now common. It is not a corner case any more.
>
>
> Image processing may be a special in that many cases it is almost 
> embarrassingly parallel.

yes but who likes to do that ?
One trivial case : Divide a image by its mean :
Compute the mean of the image
Divide the image by its mean

It should be 3 small lines of code no more.

Using the "embarrassingly parallel paradigm" to compute that, I would 
have to store the partial results and then run another exe to read then. 
Ugly. ugly but very common in the proto phases. Or it can be pipes or 
sockets or...wait just write in C/MPI if you want to do that. Tunning 
this C/MPI code you will get the best performances.

Ok fine. Fine but in a few months quadcores will be "cheap". Using 
numpy, I now I never get the best performances on a multicores machine 
and I do not care. I just get the best 
performance/time_needed_to_code_that ratio, by far, and that is why IMHO 
numpy is great :). The problem is that on a multicore machine, this 
ratio is not that high because there is no way to perform s = sum(A) in 
a "maybe-sub-obtimal  but not nonocore" way. Sublinear scaling (let say 
real life scaling) will always be better that  nothing.
 

Xavier

> Perhaps some special libraries for that sort of application could be 
> put together and just bits of c code be run on different processors. 
> Not that I know much about parallel processing, but that would be my 
> first take.
>
> Chuck
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>   



More information about the Numpy-discussion mailing list