[Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)
Wed Jul 8 17:23:03 CDT 2009
> Ticket #1143 points out that Numpy's reduction operations are not
> always cache friendly. I worked a bit on tuning them.
> Just to tickle some interest, a "pathological" case before optimization:
> In [1]: import numpy as np
> In [2]: x = np.zeros((80000, 256))
> In [3]: %timeit x.sum(axis=0)
> 10 loops, best of 3: 850 ms per loop
> After optimization:
>
> In [1]: import numpy as np
> In [2]: x = np.zeros((80000, 256))
> In [3]: %timeit x.sum(axis=0)
> 10 loops, best of 3: 78.5 ms per loop
> For comparison, a reduction operation on a contiguous array of
> the same size:
>
> In [4]: x = np.zeros((256, 80000))
> In [5]: %timeit x.sum(axis=1)
> 10 loops, best of 3: 88.9 ms per loop
> Funnily enough, it's actually slower than the reduction over the
> axis with the larger stride. The improvement factor depends on
> the CPU and its cache size.
How do the benchmarks compare with just making contiguous copies? Which is
blocking of sort, I suppose.
Chuck
