[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
David Cournapeau
david@ar.media.kyoto-u.ac...
Sun Mar 23 22:37:27 CDT 2008
Gnata Xavier wrote:
> Ok I will try to see what I can do but it is sure that we do need the
> plug-in system first (read "before the threads in the numpy release").
> During the devel of 1.1, I will try to find some time to understand
> where I should put some pragma into ufunct using a very conservation
> approach. Any people with some OpenMP knowledge are welcome because I'm
> not a OpenMP expert but only an OpenMP user in my C/C++ codes.
Note that the plug-in idea is just my own idea, it is not something
agreed by anyone else. So maybe it won't be done for numpy 1.1, or at
all. It depends on the main maintainers of numpy.
>
>
> and the results :
> 10000000 80 10.308471 30.007250
> 1000000 160 1.902563 5.800172
> 100000 320 0.543008 1.123274
> 10000 640 0.206823 0.223031
> 1000 1280 0.088898 0.044268
> 100 2560 0.150429 0.008880
> 10 5120 0.289589 0.002084
>
> ---> On this machine, we should start to use threads *in this testcase*
> iif size>=10000 (a 100*100 image is a very very small one :))
Maybe openMP can be more clever, but it tends to show that openMP, when
used naively, can *not* decide how many threads to use. That's really
the core problem: again, I don't know much about openMP, but almost any
project using multi-thread/multi-process and not being embarrassingly
parallel has the problem that it makes things much slower for many cases
where thread creation/management and co have a lot of overhead
proportionally to the computation. The problem is to determine the N,
dynamically, or in a way which works well for most cases. OpenMP was
created for HPC, where you have very large data; it is not so obvious to
me that it is adapted to numpy which has to be much more flexible. Being
fast on a given problem is easy; being fast on a whole range, that's
another story: the problem really is to be as fast as before on small
arrays.
The fact that matlab, while having much more ressources than us, took
years to do it, makes me extremely skeptical on the efficient use of
multi-threading without real benchmarks for numpy. They have a dedicated
team, who developed a JIT for matlab, which "insert" multi-thread code
on the fly (for m files, not when you are in the interpreter), and who
uses multi-thread blas/lapack (which is already available in numpy
depending on the blas/lapack you are using).
But again, and that's really the only thing I have to say: prove me wrong :)
David
More information about the Numpy-discussion
mailing list