[Numpy-discussion] poor performance of sum with sub-machine-word integer types
Zachary Pincus
zachary.pincus@yale....
Tue Jun 21 11:46:03 CDT 2011
Hello all,
As a result of the "fast greyscale conversion" thread, I noticed an anomaly with numpy.ndararray.sum(): summing along certain axes is much slower with sum() than versus doing it explicitly, but only with integer dtypes and when the size of the dtype is less than the machine word. I checked in 32-bit and 64-bit modes and in both cases only once the dtype got as large as that did the speed difference go away. See below...
Is this something to do with numpy or something inexorable about machine / memory architecture?
Zach
Timings -- 64-bit mode:
----------------------
In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
In [3]: timeit i.sum(axis=-1)
10 loops, best of 3: 131 ms per loop
In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 2.57 ms per loop
In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
In [6]: timeit i.sum(axis=-1)
10 loops, best of 3: 131 ms per loop
In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 4.75 ms per loop
In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
In [9]: timeit i.sum(axis=-1)
10 loops, best of 3: 131 ms per loop
In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 6.37 ms per loop
In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
In [12]: timeit i.sum(axis=-1)
100 loops, best of 3: 16.6 ms per loop
In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 15.1 ms per loop
Timings -- 32-bit mode:
----------------------
In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
In [3]: timeit i.sum(axis=-1)
10 loops, best of 3: 138 ms per loop
In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 3.68 ms per loop
In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
In [6]: timeit i.sum(axis=-1)
10 loops, best of 3: 140 ms per loop
In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 4.17 ms per loop
In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
In [9]: timeit i.sum(axis=-1)
10 loops, best of 3: 22.4 ms per loop
In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
100 loops, best of 3: 12.2 ms per loop
In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
In [12]: timeit i.sum(axis=-1)
10 loops, best of 3: 29.2 ms per loop
In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
10 loops, best of 3: 23.8 ms per loop
