# [Numpy-discussion] When memory access is a bottleneck

Keith Goodman kwgoodman@gmail....
Fri Feb 25 11:30:05 CST 2011

```A topic that often comes up on the list is that arr.sum(axis=-1) is
faster than arr.sum(axis=0). For C ordered arrays, moving along the
last axis moves the smallest amount in memory. And moving small
amounts in memory keeps the data in cache longer. Can I use that fact
to speed up calculations along axis=0?

As a test I used the nanmean function from the Bottleneck package. I
created a second version of the function that calculates the nanmean
of two columns at a time (code below). I figured that would be a more
efficient use of cache. It is faster for many array sizes but not for
all. Is there anything I can do about that?

Is the speed up really due to cache? The smallest arrays I tried
should fit entirely in cache, so perhaps unrolling the loop is helping
the compiler be more efficient? Yeah, so obviously, I don't understand
what is going on.

Faster:

>> a = np.random.rand(10,10)
>> timeit nanmean_2d_float64_axis0(a)
1000000 loops, best of 3: 1.24 us per loop
>> timeit nanmean_2d_float64_axis0_double(a)
1000000 loops, best of 3: 1.17 us per loop

>> a = np.random.rand(100,100)
>> timeit nanmean_2d_float64_axis0(a)
100000 loops, best of 3: 16 us per loop
>> timeit nanmean_2d_float64_axis0_double(a)
100000 loops, best of 3: 15.7 us per loop

>> a = np.random.rand(1000,1000)
>> timeit nanmean_2d_float64_axis0(a)
100 loops, best of 3: 8.57 ms per loop
>> timeit nanmean_2d_float64_axis0_double(a)
100 loops, best of 3: 5.02 ms per loop

>> a = np.random.rand(10000,100)
>> timeit nanmean_2d_float64_axis0(a)
100 loops, best of 3: 3.52 ms per loop
>> timeit nanmean_2d_float64_axis0_double(a)
100 loops, best of 3: 3.35 ms per loop

Slower:

>> a = np.random.rand(100,10000)
>> timeit nanmean_2d_float64_axis0(a)
100 loops, best of 3: 2.22 ms per loop
>> timeit nanmean_2d_float64_axis0_double(a)
100 loops, best of 3: 2.57 ms per loop

@cython.boundscheck(False)
@cython.wraparound(False)
def nanmean_2d_float64_axis0(np.ndarray[np.float64_t, ndim=2] a):
"Mean of 2d array with dtype=float64 along axis=0 ignoring NaNs."
cdef int count = 0
cdef np.float64_t asum = 0, ai
cdef Py_ssize_t i0, i1
cdef np.npy_intp *dim
dim = PyArray_DIMS(a)
cdef int n0 = dim[0]
cdef int n1 = dim[1]
cdef np.npy_intp *dims = [n1]
cdef np.ndarray[np.float64_t, ndim=1] y = PyArray_EMPTY(1, dims,
NPY_float64, 0)
for i1 in range(n1):
asum = 0
count = 0
for i0 in range(n0):
ai = a[i0, i1]
if ai == ai:
asum += ai
count += 1
if count > 0:
y[i1] = asum / count
else:
y[i1] = NAN
return y

@cython.boundscheck(False)
@cython.wraparound(False)
def nanmean_2d_float64_axis0_double(np.ndarray[np.float64_t, ndim=2] a):
"Mean of 2d array with dtype=float64 along axis=0 ignoring NaNs."
cdef int count = 0, count2 = 0
cdef np.float64_t asum = 0, asum2 = 0, ai
cdef Py_ssize_t i0, i1, i11
cdef np.npy_intp *dim
dim = PyArray_DIMS(a)
cdef int n0 = dim[0]
cdef int n1 = dim[1]
cdef np.npy_intp *dims = [n1]
cdef np.ndarray[np.float64_t, ndim=1] y = PyArray_EMPTY(1, dims,
NPY_float64, 0)
for i1 in range(0,n1,2):
asum = 0
count = 0
asum2 = 0
count2 = 0
i11 = i1 + 1
for i0 in range(n0):
ai = a[i0, i1]
if ai == ai:
asum += ai
count += 1
ai = a[i0, i11]
if ai == ai:
asum2 += ai
count2 += 1
if count > 0:
y[i1] = asum / count
else:
y[i1] = NAN
if count2 > 0:
y[i11] = asum / count
else:
y[i11] = NAN
return y
```