[Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

Pauli Virtanen pav+sp@iki...
Wed Jul 8 17:16:22 CDT 2009


Hi,

Ticket #1143 points out that Numpy's reduction operations are not
always cache friendly. I worked a bit on tuning them.


Just to tickle some interest, a "pathological" case before optimization:

    In [1]: import numpy as np
    In [2]: x = np.zeros((80000, 256))
    In [3]: %timeit x.sum(axis=0)
    10 loops, best of 3: 850 ms per loop

After optimization:

    In [1]: import numpy as np
    In [2]: x = np.zeros((80000, 256))
    In [3]: %timeit x.sum(axis=0)
    10 loops, best of 3: 78.5 ms per loop

For comparison, a reduction operation on a contiguous array of 
the same size:

    In [4]: x = np.zeros((256, 80000))
    In [5]: %timeit x.sum(axis=1)
    10 loops, best of 3: 88.9 ms per loop

Funnily enough, it's actually slower than the reduction over the 
axis with the larger stride. The improvement factor depends on 
the CPU and its cache size.


The code is here:

    http://github.com/pv/numpy-work/ticket/1143-speedup-reduce

A benchmark script numpy/core/src/bench-reduce.py is included. Just do

    ./bench-reduce.py ../../../build/lib.linux-i686-2.6 | tee output.txt

to run it against the system Numpy, and

    ./bench-reduce.py plot < output.txt

to plot the results. Here's one result set:

    http://www.iki.fi/pav/tmp/xeon.png
    http://www.iki.fi/pav/tmp/xeon.txt

The y-axis is proportional to the number of elemental (addition) 
operations per second, x-axis the total size of the array. 
Vertical lines show how the result changed by the cache 
optimization. Factor of 2 improvement seems common in the test 
cases (typically, arrays that are "skinny" in some dimension). 
There are also some factors of 10, especially for arrays larger 
than the 6144 kb cache of this particular CPU.  Also, for this 
CPU, there are no significant regressions.


On an older CPU (slower, smaller cache), the situation is 
slightly different:

    http://www.iki.fi/pav/tmp/athlon.png
    http://www.iki.fi/pav/tmp/athlon.txt

On average, it's still an improvement in many cases.  However, 
now there are more regressions. The significant ones (factor of 
1/2) are N-D arrays where the reduction runs over an axis with a 
small number of elements.


The optimization is in evaluating the reduction operation with a 
blocked loop (pseudocode)

    for block_idx in loop_over_blocks_except_axis(arr):
        result[block_idx] = arr[block_idx,0]
	for j in xrange(1, N):
	    result[block_idx] += arr[block_idx,j]  } ufunc loop

instead of the usual ufunc reduction loop

    for idx in loop_over_array_except_axis(arr):
        result[idx] = arr[idx,0]                   }
	for j in xrange(1, N):                     } ufunc loop
	    result[idx] += arr[idx,j]              }

The blocks are chosen to minimize the stride of the inmost ufunc 
loop, maximizing reuse of elements fitting in a single cache 
line. This has to be balanced against avoiding inner loops over 
very small dimensions, which incur some overhead. For N-D arrays, 
I also attempt to lump several dimensions into the ufunc loop, if 
the data can be iterated over with a single stride.


The decision heuristics are in ufunc_object.c:_construct_reduce, 
and the new outer loop NOBUFFER_UFUNCREDUCE_TRANSPOSE in 
ufunc_object.c:PyUFunc_Reduce.


Unfortunately, improving the performance using the above scheme 
comes at the cost of some slightly murky heuristics.  I didn't 
manage to come up with an optimal decision rule, so they are 
partly empirical. There is one parameter tuning the cross-over 
between minimizing stride and avoiding small dimensions. (This is 
more or less straightforward.)  Another empirical decision is 
required in choosing whether to use the usual reduction loop, 
which is better in some cases, or the blocked loop. How to make 
this latter choice is not so clear to me.

The present approach seems to mostly work, as far as the 
benchmark produces improvements and produces no severe 
regressions.  However, I think the regressions for the older CPU 
point out that something still needs to be tuned. The main 
suspect is the latter decision rule, but I'm not sure what the 
correct solution here is.


Comments?


I'm not a cache optimization expert, so bright ideas about better 
decision rules and further optimizations would be appreciated.

Also, I think this addition could be nice to have included in 
Numpy. The main value IMHO is in boosting the most 
cache-inefficient corner cases. I think many users do not realize 
that

    np.zeros((256, 80000)).sum(axis=1)

is currently more efficient than

    np.zeros((80000, 256)).sum(axis=0).

And as shown above, there is no real reason to have this 
performance difference...


Cheers,
Pauli



More information about the NumPy-Discussion mailing list