[Numpy-discussion] Ufunc memory access optimization

Anne Archibald aarchiba@physics.mcgill...
Tue Jun 15 10:35:07 CDT 2010

On 15 June 2010 11:16, Pauli Virtanen <pav@iki.fi> wrote:
> ti, 2010-06-15 kello 10:10 -0400, Anne Archibald kirjoitti:
>> Correct me if I'm wrong, but this code still doesn't seem to make the
>> optimization of flattening arrays as much as possible. The array you
>> get out of np.zeros((100,100)) can be iterated over as an array of
>> shape (10000,), which should yield very substantial speedups. Since
>> most arrays one operates on are like this, there's potentially a large
>> speedup here. (On the other hand, if this optimization is being done,
>> then these tests are somewhat deceptive.)
> It does perform this optimization, and unravels the loop as much as
> possible. If all arrays are wholly contiguous, iterators are not even
> used in the ufunc loop. Check the part after
>        /* Determine how many of the trailing dimensions are contiguous
>        */
> However, in practice it seems that this typically is not a significant
> win -- I don't get speedups over the unoptimized numpy code even for
> shapes
>        (2,)*20
> where you'd think that the iterator overhead could be important:
>        x = np.zeros((2,)*20)
>        %timeit np.cos(x)
>        10 loops, best of 3: 89.9 ms (90.2 ms) per loop
> This is actually consistent with the result
>        x = np.zeros((2,)*20)
>        y = x.flatten()
>        %timeit x+x
>        10 loops, best of 3: 20.9 ms per loop
>        %timeit y+y
>        10 loops, best of 3: 21 ms per loop
> that you can probably verify with any Numpy installation.
> This result may actually be because the Numpy ufunc inner loops are not
> especially optimized -- they don't use fixed stride sizes etc., and are
> so not much more efficient than using array iterators.

This is a bit strange. I think I'd still vote for including this
optimization, since one hopes the inner loop will get faster at some
point. (If nothing else, the code generator can probably be made to
generate specialized loops for common cases).

>> On the other hand, it seems to me there's still some question about
>> how to optimize execution order when the ufunc is dealing with two or
>> more arrays with different memory layouts. In such a case, which one
>> do you reorder in favour of?
> The axis permutation is chosen so that the sum (over different arrays)
> of strides (for each dimension) is decreasing towards the inner loops.
> I'm not sure, however, that this is the optimal choice. Things may
> depend quite a bit on the cache architecture here.

I suspect that it isn't optimal. As I understand it, the key
restriction imposed by the cache architecture is that an entire cache
line - 64 bytes or so - must be loaded at once. For strides that are
larger than 64 bytes I suspect that it simply doesn't matter how big
they are. (There may be some subtle issues with cache associativity
but this would be extremely architecture-dependent.) So I would say,
first do any dimensions in which some or all strides are less than 64
bytes, starting from the smallest. After that, any order you like is

>> Is it acceptable to return
>> freshly-allocated arrays that are not C-contiguous?
> Yes, it returns non-C-contiguous arrays. The contiguity is determined so
> that the ufunc loop itself can access the memory with a single stride.
> This may cause some speed loss in some expressions. Perhaps there should
> be a subtle bias towards C-order? But I'm not sure this is worth the
> bother.

I'm more worried this may violate some users' assumptions. If a user
knows they need an array to be in C order, really they should use
ascontiguousarray. But as it stands it's enough to make sure that it's
newly-allocated as the result of an arithmetic expression. Incautious
users could suddenly start seeing copies happening, or even transposed
arrays being passed to, C and Fortran code.

It's also worth exploring common usage patterns to make sure that
numpy still gravitates towards using C contiguous arrays for
everything. I'm imagining a user who at some point adds a transposed
array to a normal array, then does a long series of computations on
the result. We want as many of those operations as possible to operate
on contiguous arrays, but it's possible that an out-of-order array
could propagate indefinitely, forcing all loops to be done with one
array having large strides, and resulting in output that is stil
out-of-order. Some preference for C contiguous output is worth adding.

It would also be valuable to build some kind of test harness to track
the memory layout of the arrays generated in some "typical"
calculations as well as the ordering of the loop used.

More generally, this problem is exactly what ATLAS is for - finding
cache-optimal orders for linear algebra. So we shouldn't expect it to
be simple.


> --
> Pauli Virtanen
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

More information about the NumPy-Discussion mailing list