[Numpy-discussion] [OT] Starving CPUs article featured in IEEE's ComputingNow portal
Sat Mar 20 15:18:02 CDT 2010
On Sat, Mar 20, 2010 at 8:22 PM, Anne Archibald
> On 20 March 2010 14:56, Dag Sverre Seljebotn
> <firstname.lastname@example.org> wrote:
>> Pauli Virtanen wrote:
>>> Anne Archibald wrote:
>>>> I'm not knocking numpy; it does (almost) the best it can. (I'm not
>>>> sure of the optimality of the order in which ufuncs are executed; I
>>>> think some optimizations there are possible.)
>>> Ufuncs and reductions are not performed in a cache-optimal fashion, IIRC
>>> dimensions are always traversed in order from left to right. Large
>>> speedups are possible in some cases, but in a quick try I didn't manage to
>>> come up with an algorithm that would always improve the speed (there was a
>>> thread about this last year or so, and there's a ticket). Things varied
>>> between computers, so this probably depends a lot on the actual cache
>>> But perhaps numexpr has such heuristics, and we could steal them?
>> At least in MultiIter (and I always assumed ufuncs too, but perhaps not)
>> there's functionality to remove the largest dimension so that it can be
>> put innermost in a loop. In many situations, removing the dimension with
>> the smallest stride from the iterator would probably work much better.
>> It's all about balancing iterator overhead and memory overhead. Something
>> simple like "select the dimension with length > 200 which has smallest
>> stride, or the dimension with largest length if none are above 200" would
>> perhaps work well?
> There's more to it than that: there's no point (I think) going with
> the smallest stride if that stride is more than a cache line (usually
> 64 bytes). There are further optimizations - often
> stride[i]*shape[i]==shape[j] for some (i,j), and in that case these
> two dimensions can be treated as one with stride stride[i] and length
> shape[i]*shape[j]. Applying this would "unroll" all C- or
> Fortran-contiguous arrays to a single non-nested loop.
> Of course, reordering ufuncs is potentially hazardous in terms of
> breaking user code: the effect of
> depends on whether you run through A in index order or reverse index
> order (which might be the order it's stored in memory, potentially
> faster from a cache point of view).
I think Travis O suggested at some point to make this kind of
"overlapping inplace" operations invalid.
(maybe it was someone else ...)
The idea was to protect new ("uninitiated") users from unexpected results.
Anyway, maybe one could use (something like) np.seterror to determine
how to handle these cases...
More information about the NumPy-Discussion