[Numpy-discussion] Numpy and OpenMP
Mon Mar 17 14:59:08 CDT 2008
Francesc Altet wrote:
> A Monday 17 March 2008, Christopher Barker escrigué:
>>> > Plus a certain amount of numpy code depends on order of
>>> > evaluation:
>>> > a[:-1] = 2*a[1:]
>> I'm confused here. My understanding of how it now works is that the
>> above translates to:
>> 1) create a new array (call it temp1) from a[1:], which shares a's
>> data block.
>> 2) create a temp2 array by multiplying 2 times each of the elements
>> in temp1, and writing them into a new array, with a new data block 3)
>> copy that temporary array into a[:-1]
>> Why couldn't step (2) be parallelized? Why isn't it already with,
>> BLAS? Doesn't BLAS must have such simple routines?
> Probably yes, but the problem is that this kind of operations, namely,
> vector-to-vector (usually found in the BLAS1 subset of BLAS), are
> normally memory-bounded, so you can take little avantage from using
> BLAS, most specially in modern processors, where the gap between the
> CPU throughput and the memory bandwith is quite high (and increasing).
> In modern machines, the use of BLAS is more interesting in vector-matrix
> (BLAS2) computations, but definitely is in matrix-matrix (BLAS3) ones
> (which is where the oportunities for cache reuse is higher) where the
> speedups can really be very good.
>> Also, maybe numexpr could benefit from this?
> Maybe, but unfortunately it wouldn't be able to achieve high speedups.
> Right now, numexpr is focused in accelerating mainly vector-vector
> operations (or matrix-matrix, but element-wise, much like NumPy, so
> that the cache cannot be reused), with some smart optimizations for
> strided and unaligned arrays (in this scenario, it can be 2x or 3x
> faster than NumPy, even for very simple operations like 'a+b').
> In a similar way, OpenMP (or whatever parallel paradigm) will only
> generally be useful when you have to deal with lots of data, and your
> algorithm can have the oportunity to structure it so that small
> portions of them can be reused many times.
Well, linear alagera is another topic.
What I can see from IDL (for innstance) is that it provides the user
with a TOTAL function which take avantage of several CPU when the
number of elements is large. It also provides a very simple way to set a
max number of threads.
I really really would like to see something like that in numpy (just to
be able to tell somone "switch to numpy it is free and you will get
exactly the same"). For now, I have a problem when they ask for //
functions like TOTAL.
For now, we can do that using C inline threaded code but it is *complex*
and 2000x2000 images are now common. It is not a corner case any more.
More information about the Numpy-discussion