[Numpy-discussion] testing with amd libm/acml

Dag Sverre Seljebotn d.s.seljebotn@astro.uio...
Thu Nov 8 12:55:20 CST 2012


On 11/08/2012 06:59 PM, Francesc Alted wrote:
> On 11/8/12 6:38 PM, Dag Sverre Seljebotn wrote:
>> On 11/08/2012 06:06 PM, Francesc Alted wrote:
>>> On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote:
>>>> On 11/07/2012 08:41 PM, Neal Becker wrote:
>>>>> Would you expect numexpr without MKL to give a significant boost?
>>>> If you need higher performance than what numexpr can give without using
>>>> MKL, you could look at code such as this:
>>>>
>>>> https://github.com/herumi/fmath/blob/master/fmath.hpp#L480
>>> Hey, that's cool.  I was a bit disappointed not finding this sort of
>>> work in open space.  It seems that this lacks threading support, but
>>> that should be easy to implement by using OpenMP directives.
>> IMO this is the wrong place to introduce threading; each thread should
>> call expd_v on its chunks. (Which I think is how you said numexpr
>> currently uses VML anyway.)
>
> Oh sure, but then you need a blocked engine for performing the
> computations too.  And yes, by default numexpr uses its own threading

I just meant that you can use a chunked OpenMP for-loop wherever in your 
code that you call expd_v. A "five-line blocked engine", if you like :-)

IMO that's the right location since entering/exiting OpenMP blocks takes 
some time.

> code rather than the existing one in VML (but that can be changed by
> playing with set_num_threads/set_vml_num_threads).  It always stroked to
> me as a little strange that the internal threading in numexpr was more
> efficient than VML one, but I suppose this is because the latter is more
> optimized to deal with large blocks instead of those of medium size (4K)
> in numexpr.

I don't know enough about numexpr to understand this :-)

I guess I just don't see the motivation to use VML threading or why it 
should be faster? If you pass a single 4K block to a threaded VML call 
then I could easily see lots of performance problems: a) 
starting/stopping threads or signalling the threads of a pool is a 
constant overhead per "parallel section", b) unless you're very careful 
to only have VML touch the data, and VML always schedules elements in 
the exact same way, you're going to have the cache lines of that 4K 
block shuffled between L1 caches of different cores for different 
operations...

As I said, I'm mostly ignorant about how numexpr works, that's probably 
showing :-)

Dag Sverre


More information about the NumPy-Discussion mailing list