[Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)
Charles R Harris
charlesr.harris@gmail....
Wed Jul 8 17:23:03 CDT 2009
On Wed, Jul 8, 2009 at 4:16 PM, Pauli Virtanen <pav+sp@iki.fi<pav%2Bsp@iki.fi>
> wrote:
> Hi,
>
> Ticket #1143 points out that Numpy's reduction operations are not
> always cache friendly. I worked a bit on tuning them.
>
>
> Just to tickle some interest, a "pathological" case before optimization:
>
> In [1]: import numpy as np
> In [2]: x = np.zeros((80000, 256))
> In [3]: %timeit x.sum(axis=0)
> 10 loops, best of 3: 850 ms per loop
>
> After optimization:
>
> In [1]: import numpy as np
> In [2]: x = np.zeros((80000, 256))
> In [3]: %timeit x.sum(axis=0)
> 10 loops, best of 3: 78.5 ms per loop
>
> For comparison, a reduction operation on a contiguous array of
> the same size:
>
> In [4]: x = np.zeros((256, 80000))
> In [5]: %timeit x.sum(axis=1)
> 10 loops, best of 3: 88.9 ms per loop
>
;)
>
> Funnily enough, it's actually slower than the reduction over the
> axis with the larger stride. The improvement factor depends on
> the CPU and its cache size.
>
>
How do the benchmarks compare with just making contiguous copies? Which is
blocking of sort, I suppose.
Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20090708/9fdbacab/attachment.html
More information about the NumPy-Discussion
mailing list