[Numpy-discussion] poor performance of sum with sub-machine-word integer types

Charles R Harris charlesr.harris@gmail....
Tue Jun 21 12:25:40 CDT 2011


On Tue, Jun 21, 2011 at 11:17 AM, Keith Goodman <kwgoodman@gmail.com> wrote:

> On Tue, Jun 21, 2011 at 9:46 AM, Zachary Pincus <zachary.pincus@yale.edu>
> wrote:
> > Hello all,
> >
> > As a result of the "fast greyscale conversion" thread, I noticed an
> anomaly with numpy.ndararray.sum(): summing along certain axes is much
> slower with sum() than versus doing it explicitly, but only with integer
> dtypes and when the size of the dtype is less than the machine word. I
> checked in 32-bit and 64-bit modes and in both cases only once the dtype got
> as large as that did the speed difference go away. See below...
> >
> > Is this something to do with numpy or something inexorable about machine
> / memory architecture?
> >
> > Zach
> >
> > Timings -- 64-bit mode:
> > ----------------------
> > In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
> > In [3]: timeit i.sum(axis=-1)
> > 10 loops, best of 3: 131 ms per loop
> > In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> > 100 loops, best of 3: 2.57 ms per loop
> >
> > In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
> > In [6]: timeit i.sum(axis=-1)
> > 10 loops, best of 3: 131 ms per loop
> > In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> > 100 loops, best of 3: 4.75 ms per loop
> >
> > In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
> > In [9]: timeit i.sum(axis=-1)
> > 10 loops, best of 3: 131 ms per loop
> > In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> > 100 loops, best of 3: 6.37 ms per loop
> >
> > In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
> > In [12]: timeit i.sum(axis=-1)
> > 100 loops, best of 3: 16.6 ms per loop
> > In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> > 100 loops, best of 3: 15.1 ms per loop
> >
> >
> >
> > Timings -- 32-bit mode:
> > ----------------------
> > In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
> > In [3]: timeit i.sum(axis=-1)
> > 10 loops, best of 3: 138 ms per loop
> > In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> > 100 loops, best of 3: 3.68 ms per loop
> >
> > In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
> > In [6]: timeit i.sum(axis=-1)
> > 10 loops, best of 3: 140 ms per loop
> > In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> > 100 loops, best of 3: 4.17 ms per loop
> >
> > In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
> > In [9]: timeit i.sum(axis=-1)
> > 10 loops, best of 3: 22.4 ms per loop
> > In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> > 100 loops, best of 3: 12.2 ms per loop
> >
> > In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
> > In [12]: timeit i.sum(axis=-1)
> > 10 loops, best of 3: 29.2 ms per loop
> > In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> > 10 loops, best of 3: 23.8 ms per loop
>
> One difference is that i.sum() changes the output dtype of int input
> when the int dtype is less than the default int dtype:
>
>    >> i.dtype
>       dtype('int32')
>     >> i.sum(axis=-1).dtype
>        dtype('int64') #  <-- dtype changed
>    >> (i[...,0]+i[...,1]+i[...,2]+i[...,3]).dtype
>       dtype('int32')
>
> Here are my timings
>
>    >> i = numpy.ones((1024,1024,4), numpy.int32)
>     >> timeit i.sum(axis=-1)
>     1 loops, best of 3: 278 ms per loop
>     >> timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
>     100 loops, best of 3: 12.1 ms per loop
>    >> import bottleneck as bn
>    >> timeit bn.func.nansum_3d_int32_axis2(i)
>    100 loops, best of 3: 8.27 ms per loop
>
> Does making an extra copy of the input explain all of the speed
> difference (is this what np.sum does internally?):
>
>    >> timeit i.astype(numpy.int64)
>     10 loops, best of 3: 29.2 ms per loop
>
> No.
>
>
I think you can see the overhead here:

In [14]: timeit einsum('ijk->ij', i, dtype=int32)
100 loops, best of 3: 17.6 ms per loop

In [15]: timeit einsum('ijk->ij', i, dtype=int64)
100 loops, best of 3: 18 ms per loop

In [16]: timeit einsum('ijk->ij', i, dtype=int16)
100 loops, best of 3: 18.3 ms per loop

In [17]: timeit einsum('ijk->ij', i, dtype=int8)
100 loops, best of 3: 5.87 ms per loop


> Initializing the output also adds some time:
>
>    >> timeit np.empty((1024,1024,4), dtype=np.int32)
>    100000 loops, best of 3: 2.67 us per loop
>    >> timeit np.empty((1024,1024,4), dtype=np.int64)
>    100000 loops, best of 3: 12.8 us per loop
>
> Switching back and forth between the input and output array takes more
> "memory" time too with int64 arrays compared to int32.
>

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20110621/52782083/attachment.html 


More information about the NumPy-Discussion mailing list