[Numpy-discussion] Ufunc memory access optimization

Pauli Virtanen pav@iki...
Tue Jun 15 10:16:11 CDT 2010

ti, 2010-06-15 kello 10:10 -0400, Anne Archibald kirjoitti:
> Correct me if I'm wrong, but this code still doesn't seem to make the
> optimization of flattening arrays as much as possible. The array you
> get out of np.zeros((100,100)) can be iterated over as an array of
> shape (10000,), which should yield very substantial speedups. Since
> most arrays one operates on are like this, there's potentially a large
> speedup here. (On the other hand, if this optimization is being done,
> then these tests are somewhat deceptive.)

It does perform this optimization, and unravels the loop as much as
possible. If all arrays are wholly contiguous, iterators are not even
used in the ufunc loop. Check the part after

        /* Determine how many of the trailing dimensions are contiguous

However, in practice it seems that this typically is not a significant
win -- I don't get speedups over the unoptimized numpy code even for


where you'd think that the iterator overhead could be important:

        x = np.zeros((2,)*20)
        %timeit np.cos(x)
        10 loops, best of 3: 89.9 ms (90.2 ms) per loop

This is actually consistent with the result

        x = np.zeros((2,)*20)
        y = x.flatten()
        %timeit x+x
        10 loops, best of 3: 20.9 ms per loop
        %timeit y+y
        10 loops, best of 3: 21 ms per loop

that you can probably verify with any Numpy installation.

This result may actually be because the Numpy ufunc inner loops are not
especially optimized -- they don't use fixed stride sizes etc., and are
so not much more efficient than using array iterators.

> On the other hand, it seems to me there's still some question about
> how to optimize execution order when the ufunc is dealing with two or
> more arrays with different memory layouts. In such a case, which one
> do you reorder in favour of? 

The axis permutation is chosen so that the sum (over different arrays)
of strides (for each dimension) is decreasing towards the inner loops.

I'm not sure, however, that this is the optimal choice. Things may
depend quite a bit on the cache architecture here.

> Is it acceptable to return
> freshly-allocated arrays that are not C-contiguous?

Yes, it returns non-C-contiguous arrays. The contiguity is determined so
that the ufunc loop itself can access the memory with a single stride.

This may cause some speed loss in some expressions. Perhaps there should
be a subtle bias towards C-order? But I'm not sure this is worth the

Pauli Virtanen

More information about the NumPy-Discussion mailing list