[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

Gnata Xavier xavier.gnata@gmail....
Mon Mar 24 08:31:33 CDT 2008

David Cournapeau wrote:
> Gnata Xavier wrote:
>> Ok I will try to see what I can do but it is sure that we do need the 
>> plug-in system first (read "before the threads in the numpy release"). 
>> During the devel of 1.1, I will try to find some time to understand 
>> where I should put some pragma into ufunct using a very conservation 
>> approach. Any people with some OpenMP knowledge are welcome because I'm 
>> not a OpenMP expert but only an OpenMP user in my C/C++ codes.
> Note that the plug-in idea is just my own idea, it is not something 
> agreed by anyone else. So maybe it won't be done for numpy 1.1, or at 
> all. It depends on the main maintainers of numpy.
>> and the results :
>> 10000000                80      10.308471       30.007250
>> 1000000         160     1.902563        5.800172
>> 100000          320     0.543008        1.123274
>> 10000           640     0.206823        0.223031
>> 1000            1280    0.088898        0.044268
>> 100             2560    0.150429        0.008880
>> 10              5120    0.289589        0.002084
>>  ---> On this machine, we should start to use threads *in this testcase* 
>> iif size>=10000 (a 100*100 image is a very very small one :))
> Maybe openMP can be more clever, but it tends to show that openMP, when 
> used naively, can *not* decide how many threads to use. That's really 
> the core problem: again, I don't know much about openMP, but almost any 
> project using multi-thread/multi-process and not being embarrassingly 
> parallel has the problem that it makes things much slower for many cases 
> where thread creation/management and co have a lot of overhead 
> proportionally to the computation. The problem is to determine the N, 
> dynamically, or in a way which works well for most cases. OpenMP was 
> created for HPC, where you have very large data; it is not so obvious to 
> me that it is adapted to numpy which has to be much more flexible. Being 
> fast on a given problem is easy; being fast on a whole range, that's 
> another story: the problem really is to be as fast as before on small 
> arrays.
> The fact that matlab, while having much more ressources than us, took 
> years to do it, makes me extremely skeptical on the efficient use of 
> multi-threading without real benchmarks for numpy. They have a dedicated 
> team, who developed a JIT for matlab, which "insert" multi-thread code 
> on the fly (for m files, not when you are in the interpreter), and who 
> uses multi-thread blas/lapack (which is already available in numpy 
> depending on the blas/lapack you are using).
> But again, and that's really the only thing I have to say: prove me wrong :)
> David
I can't :) I can't for a simple reason : Quoting IDL documentation :

"There are instances when allowing IDL to use its default thread pool 
settings can lead to undesired results. In some instances, a 
multithreaded implementation using the thread pool may actually take 
longer to complete a given job than a single-threaded implementation."

"To prevent the use of the thread pool for computations that involve too 
few data elements, IDL supports a minimum threshold value for thread 
pool computations. The minimum threshold value is contained in the 
TPOOL_MIN_ELTS field of the !CPU system variable. See the following 
sections for details on modifying this value."

At work, I can see people switching from IDL to numpy/scipy/pylab.
They are very happy with numpy but they would to find this "thread pool 
capability" in numpy.
All these guys come from C (or from fortran), often from C/fortran MPI 
or OpenMP.
They know which part of a code should be thread and which part should 
not. As a result, they are very happy with the IDL thread pool.

I'm just thinking how to translate that into numpy.
Now I have to have a close look at the ufuncs code and to figure out how 
to add -fopenmp.

 From a very pragmatic point of view :

What is the best/simplest way to use inline C or whatever to do that :

"I have a large array A and, at some points of my nice numpy code,  I 
would like to compute let say the threaded sum or the sine of this 
array? Assuming that I know how to write it in C/OpenMP code." (The 
background is "I really know that in my case it is much faster... and I 
asked my boss for a multi-core machine ;)").


More information about the Numpy-discussion mailing list