# [Numpy-discussion] Ufunc memory access optimization

Pauli Virtanen pav@iki...
Tue Jun 15 13:15:39 CDT 2010

ti, 2010-06-15 kello 11:35 -0400, Anne Archibald kirjoitti:
[clip: Numpy ufunc inner loops are slow]
> This is a bit strange. I think I'd still vote for including this
> optimization, since one hopes the inner loop will get faster at some
> point. (If nothing else, the code generator can probably be made to
> generate specialized loops for common cases).

Ok, apparently I couldn't read. Numpy already unrolls the ufunc loop for
C-contiguous arrays, that's the reason there was no difference.

This should be fairer:

x = np.zeros((2,)*20).transpose([1,0]+range(2,20))
y = x.flatten()
%timeit x+x
10 loops, best of 3: 122 ms per loop
%timeit y+y
10 loops, best of 3: 19.9 ms per loop

And with the loop order optimization:

%timeit x+x
10 loops, best of 3: 20 ms per loop

To get an idea how things look without collapsing the loops, one can
engineer nastier arrays:

x = np.zeros([50*5,50*3,50*2])[:-23:5, :-29:3, 31::2]
%timeit x+x
100 loops, best of 3: 2.83 ms (3.59 ms) per loop

[clip: minimize sum of strides]
> I suspect that it isn't optimal. As I understand it, the key
> restriction imposed by the cache architecture is that an entire cache
> line - 64 bytes or so - must be loaded at once. For strides that are
> larger than 64 bytes I suspect that it simply doesn't matter how big
> they are. (There may be some subtle issues with cache associativity
> but this would be extremely architecture-dependent.) So I would say,
> first do any dimensions in which some or all strides are less than 64
> bytes, starting from the smallest. After that, any order you like is
> fine.

This is all up to designing a suitable objective function. Suppose
s_{i,j} is the stride of array i in dimension j. The

f_j = \sum_i |s_{i,j}|

and an axis permutation which makes the sequence f_j decreasing is
chosen. Other heuristics would use a different f_j.

One metric that I used previously when optimizing the reductions, was
the number of elements that fit in a cache line, as you suggested. The
sum of this for all arrays should be maximized in the inner loops. So
one could take the objective function

f_j = - C_1 C_2 \sum_i CACHELINE / (1 + C_2 |s_{i,j}|)

with some constants C_1 and C_2, which, perhaps, would give better
results. Note that the stride can be zero for broadcasted arrays, so we
need to add a cutoff for that. With a bias toward C order, one could
then have

f_j = - C_3 j - C_1 C_2 \sum_i CACHELINE / (1 + C_2 |s_{i,j}|)

Now the question would be just choosing C_1, C_2, and C_3 suitably.

Another optimization could be flipping negative strides. This sounds a
bit dangerous, though, so one would need to think if it could e.g. break

a += a[::-1]

etc.

[clip]
> I'm more worried this may violate some users' assumptions. If a user
> knows they need an array to be in C order, really they should use
> ascontiguousarray. But as it stands it's enough to make sure that it's
> newly-allocated as the result of an arithmetic expression. Incautious
> users could suddenly start seeing copies happening, or even transposed
> arrays being passed to, C and Fortran code.

Yes, it's a semantic change, and we have to make it consciously (and
conscientiously, to boot :).

I believe f2py checked the contiguity of intent(inout) arguments?
The same should go for most C code that accepts Numpy arrays, at least
PyArray_IsContiguous should be checked before assuming it is true. But
this would still mean that code that previously worked, could start
throwing up exceptions.

Personally, I believe this change is worth making, with suitable mention
in the release notes.

> It's also worth exploring common usage patterns to make sure that
> numpy still gravitates towards using C contiguous arrays for
> everything. I'm imagining a user who at some point adds a transposed
> array to a normal array, then does a long series of computations on
> the result. We want as many of those operations as possible to operate
> on contiguous arrays, but it's possible that an out-of-order array
> could propagate indefinitely, forcing all loops to be done with one
> array having large strides, and resulting in output that is stil
> out-of-order.

I think, at present, non-C-contiguous arrays will propagate
indefinitely.

> Some preference for C contiguous output is worth adding.

Suggestions for better heuristics are accepted, just state it in the
form of an objective function :)

> It would also be valuable to build some kind of test harness to track
> the memory layout of the arrays generated in some "typical"
> calculations as well as the ordering of the loop used.
>
> More generally, this problem is exactly what ATLAS is for - finding
> cache-optimal orders for linear algebra. So we shouldn't expect it to
> be simple.

Yes, I do not think any of us is expecting it to be simple. I don't
think we can aim for the optimal solution, since it is ill-defined, but
only for one that is "good enough in practice".

--
Pauli Virtanen