[Numpy-discussion] testing with amd libm/acml
Dag Sverre Seljebotn
d.s.seljebotn@astro.uio...
Thu Nov 8 12:55:20 CST 2012
On 11/08/2012 06:59 PM, Francesc Alted wrote:
> On 11/8/12 6:38 PM, Dag Sverre Seljebotn wrote:
>> On 11/08/2012 06:06 PM, Francesc Alted wrote:
>>> On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote:
>>>> On 11/07/2012 08:41 PM, Neal Becker wrote:
>>>>> Would you expect numexpr without MKL to give a significant boost?
>>>> If you need higher performance than what numexpr can give without using
>>>> MKL, you could look at code such as this:
>>>>
>>>> https://github.com/herumi/fmath/blob/master/fmath.hpp#L480
>>> Hey, that's cool. I was a bit disappointed not finding this sort of
>>> work in open space. It seems that this lacks threading support, but
>>> that should be easy to implement by using OpenMP directives.
>> IMO this is the wrong place to introduce threading; each thread should
>> call expd_v on its chunks. (Which I think is how you said numexpr
>> currently uses VML anyway.)
>
> Oh sure, but then you need a blocked engine for performing the
> computations too. And yes, by default numexpr uses its own threading
I just meant that you can use a chunked OpenMP for-loop wherever in your
code that you call expd_v. A "five-line blocked engine", if you like :-)
IMO that's the right location since entering/exiting OpenMP blocks takes
some time.
> code rather than the existing one in VML (but that can be changed by
> playing with set_num_threads/set_vml_num_threads). It always stroked to
> me as a little strange that the internal threading in numexpr was more
> efficient than VML one, but I suppose this is because the latter is more
> optimized to deal with large blocks instead of those of medium size (4K)
> in numexpr.
I don't know enough about numexpr to understand this :-)
I guess I just don't see the motivation to use VML threading or why it
should be faster? If you pass a single 4K block to a threaded VML call
then I could easily see lots of performance problems: a)
starting/stopping threads or signalling the threads of a pool is a
constant overhead per "parallel section", b) unless you're very careful
to only have VML touch the data, and VML always schedules elements in
the exact same way, you're going to have the cache lines of that 4K
block shuffled between L1 caches of different cores for different
operations...
As I said, I'm mostly ignorant about how numexpr works, that's probably
showing :-)
Dag Sverre
More information about the NumPy-Discussion
mailing list