[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Mon Mar 24 08:31:33 CDT 2008
David Cournapeau wrote:
> Gnata Xavier wrote:
>> Ok I will try to see what I can do but it is sure that we do need the
>> plug-in system first (read "before the threads in the numpy release").
>> During the devel of 1.1, I will try to find some time to understand
>> where I should put some pragma into ufunct using a very conservation
>> approach. Any people with some OpenMP knowledge are welcome because I'm
>> not a OpenMP expert but only an OpenMP user in my C/C++ codes.
> Note that the plug-in idea is just my own idea, it is not something
> agreed by anyone else. So maybe it won't be done for numpy 1.1, or at
> all. It depends on the main maintainers of numpy.
>> and the results :
>> 10000000 80 10.308471 30.007250
>> 1000000 160 1.902563 5.800172
>> 100000 320 0.543008 1.123274
>> 10000 640 0.206823 0.223031
>> 1000 1280 0.088898 0.044268
>> 100 2560 0.150429 0.008880
>> 10 5120 0.289589 0.002084
>> ---> On this machine, we should start to use threads *in this testcase*
>> iif size>=10000 (a 100*100 image is a very very small one :))
> Maybe openMP can be more clever, but it tends to show that openMP, when
> used naively, can *not* decide how many threads to use. That's really
> the core problem: again, I don't know much about openMP, but almost any
> project using multi-thread/multi-process and not being embarrassingly
> parallel has the problem that it makes things much slower for many cases
> where thread creation/management and co have a lot of overhead
> proportionally to the computation. The problem is to determine the N,
> dynamically, or in a way which works well for most cases. OpenMP was
> created for HPC, where you have very large data; it is not so obvious to
> me that it is adapted to numpy which has to be much more flexible. Being
> fast on a given problem is easy; being fast on a whole range, that's
> another story: the problem really is to be as fast as before on small
> The fact that matlab, while having much more ressources than us, took
> years to do it, makes me extremely skeptical on the efficient use of
> multi-threading without real benchmarks for numpy. They have a dedicated
> team, who developed a JIT for matlab, which "insert" multi-thread code
> on the fly (for m files, not when you are in the interpreter), and who
> uses multi-thread blas/lapack (which is already available in numpy
> depending on the blas/lapack you are using).
> But again, and that's really the only thing I have to say: prove me wrong :)
I can't :) I can't for a simple reason : Quoting IDL documentation :
"There are instances when allowing IDL to use its default thread pool
settings can lead to undesired results. In some instances, a
multithreaded implementation using the thread pool may actually take
longer to complete a given job than a single-threaded implementation."
"To prevent the use of the thread pool for computations that involve too
few data elements, IDL supports a minimum threshold value for thread
pool computations. The minimum threshold value is contained in the
TPOOL_MIN_ELTS field of the !CPU system variable. See the following
sections for details on modifying this value."
At work, I can see people switching from IDL to numpy/scipy/pylab.
They are very happy with numpy but they would to find this "thread pool
capability" in numpy.
All these guys come from C (or from fortran), often from C/fortran MPI
They know which part of a code should be thread and which part should
not. As a result, they are very happy with the IDL thread pool.
I'm just thinking how to translate that into numpy.
Now I have to have a close look at the ufuncs code and to figure out how
to add -fopenmp.
From a very pragmatic point of view :
What is the best/simplest way to use inline C or whatever to do that :
"I have a large array A and, at some points of my nice numpy code, I
would like to compute let say the threaded sum or the sine of this
array? Assuming that I know how to write it in C/OpenMP code." (The
background is "I really know that in my case it is much faster... and I
asked my boss for a multi-core machine ;)").
More information about the Numpy-discussion