[Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)
Pauli Virtanen
pav+sp@iki...
Wed Jul 8 17:16:22 CDT 2009
Hi,
Ticket #1143 points out that Numpy's reduction operations are not
always cache friendly. I worked a bit on tuning them.
Just to tickle some interest, a "pathological" case before optimization:
In [1]: import numpy as np
In [2]: x = np.zeros((80000, 256))
In [3]: %timeit x.sum(axis=0)
10 loops, best of 3: 850 ms per loop
After optimization:
In [1]: import numpy as np
In [2]: x = np.zeros((80000, 256))
In [3]: %timeit x.sum(axis=0)
10 loops, best of 3: 78.5 ms per loop
For comparison, a reduction operation on a contiguous array of
the same size:
In [4]: x = np.zeros((256, 80000))
In [5]: %timeit x.sum(axis=1)
10 loops, best of 3: 88.9 ms per loop
Funnily enough, it's actually slower than the reduction over the
axis with the larger stride. The improvement factor depends on
the CPU and its cache size.
The code is here:
http://github.com/pv/numpy-work/ticket/1143-speedup-reduce
A benchmark script numpy/core/src/bench-reduce.py is included. Just do
./bench-reduce.py ../../../build/lib.linux-i686-2.6 | tee output.txt
to run it against the system Numpy, and
./bench-reduce.py plot < output.txt
to plot the results. Here's one result set:
http://www.iki.fi/pav/tmp/xeon.png
http://www.iki.fi/pav/tmp/xeon.txt
The y-axis is proportional to the number of elemental (addition)
operations per second, x-axis the total size of the array.
Vertical lines show how the result changed by the cache
optimization. Factor of 2 improvement seems common in the test
cases (typically, arrays that are "skinny" in some dimension).
There are also some factors of 10, especially for arrays larger
than the 6144 kb cache of this particular CPU. Also, for this
CPU, there are no significant regressions.
On an older CPU (slower, smaller cache), the situation is
slightly different:
http://www.iki.fi/pav/tmp/athlon.png
http://www.iki.fi/pav/tmp/athlon.txt
On average, it's still an improvement in many cases. However,
now there are more regressions. The significant ones (factor of
1/2) are N-D arrays where the reduction runs over an axis with a
small number of elements.
The optimization is in evaluating the reduction operation with a
blocked loop (pseudocode)
for block_idx in loop_over_blocks_except_axis(arr):
result[block_idx] = arr[block_idx,0]
for j in xrange(1, N):
result[block_idx] += arr[block_idx,j] } ufunc loop
instead of the usual ufunc reduction loop
for idx in loop_over_array_except_axis(arr):
result[idx] = arr[idx,0] }
for j in xrange(1, N): } ufunc loop
result[idx] += arr[idx,j] }
The blocks are chosen to minimize the stride of the inmost ufunc
loop, maximizing reuse of elements fitting in a single cache
line. This has to be balanced against avoiding inner loops over
very small dimensions, which incur some overhead. For N-D arrays,
I also attempt to lump several dimensions into the ufunc loop, if
the data can be iterated over with a single stride.
The decision heuristics are in ufunc_object.c:_construct_reduce,
and the new outer loop NOBUFFER_UFUNCREDUCE_TRANSPOSE in
ufunc_object.c:PyUFunc_Reduce.
Unfortunately, improving the performance using the above scheme
comes at the cost of some slightly murky heuristics. I didn't
manage to come up with an optimal decision rule, so they are
partly empirical. There is one parameter tuning the cross-over
between minimizing stride and avoiding small dimensions. (This is
more or less straightforward.) Another empirical decision is
required in choosing whether to use the usual reduction loop,
which is better in some cases, or the blocked loop. How to make
this latter choice is not so clear to me.
The present approach seems to mostly work, as far as the
benchmark produces improvements and produces no severe
regressions. However, I think the regressions for the older CPU
point out that something still needs to be tuned. The main
suspect is the latter decision rule, but I'm not sure what the
correct solution here is.
Comments?
I'm not a cache optimization expert, so bright ideas about better
decision rules and further optimizations would be appreciated.
Also, I think this addition could be nice to have included in
Numpy. The main value IMHO is in boosting the most
cache-inefficient corner cases. I think many users do not realize
that
np.zeros((256, 80000)).sum(axis=1)
is currently more efficient than
np.zeros((80000, 256)).sum(axis=0).
And as shown above, there is no real reason to have this
performance difference...
Cheers,
Pauli
More information about the NumPy-Discussion
mailing list