[Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays

Hans Meine meine@informatik.uni-hamburg...
Thu Nov 8 06:34:33 CST 2007


Hi!

I wonder why simple elementwise operations like "a * 2" or "a + 1" are not 
performed in order of increasing memory addresses in order to exploit CPU 
caches etc. - as it is now, their speed drops by a factor of around 3 simply 
by transpose()ing.  Similarly (but even less logical), copy() and even the 
constructor are affected (yes, I understand that copy() creates contiguous 
arrays, but shouldn't it respect/retain the order nevertheless?):

### constructor ###
In [89]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3))
1000000 loops, best of 10: 1.19 s per loop

In [90]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3), order="f")
1000000 loops, best of 10: 2.19 s per loop

### copy 3x3x3 array ###
In [85]: a = numpy.ndarray((3,3,3))

In [86]: %timeit -r 10 a.copy()
1000000 loops, best of 10: 1.14 s per loop

In [87]: a = numpy.ndarray((3,3,3), order="f")

In [88]: %timeit -r 10 -n 1000000 a.copy()
1000000 loops, best of 10: 3.39 s per loop

### copy 256x256x256 array ###
In [74]: a = numpy.ndarray((256,256,256))

In [75]: %timeit -r 10 a.copy()
10 loops, best of 10: 119 ms per loop

In [76]: a = numpy.ndarray((256,256,256), order="f")

In [77]: %timeit -r 10 a.copy()
10 loops, best of 10: 274 ms per loop

### fill ###
In [79]: a = numpy.ndarray((256,256,256))

In [80]: %timeit -r 10 a.fill(0)
10 loops, best of 10: 60.2 ms per loop

In [81]: a = numpy.ndarray((256,256,256), order="f")

In [82]: %timeit -r 10 a.fill(0)
10 loops, best of 10: 60.2 ms per loop

### power ###
In [151]: a = numpy.ndarray((256,256,256))

In [152]: %timeit -r 10 a ** 2
10 loops, best of 10: 124 ms per loop

In [153]: a = numpy.asfortranarray(a)

In [154]: %timeit -r 10 a ** 2
10 loops, best of 10: 458 ms per loop

### addition ###
In [160]: a = numpy.ndarray((256,256,256))

In [161]: %timeit -r 10 a + 1
10 loops, best of 10: 139 ms per loop

In [162]: a = numpy.asfortranarray(a)

In [163]: %timeit -r 10 a + 1
10 loops, best of 10: 465 ms per loop

### fft ###
In [146]: %timeit -r 10 numpy.fft.fft(vol, axis=0)
10 loops, best of 10: 1.16 s per loop

In [148]: %timeit -r 10 numpy.fft.fft(vol0, axis=2)
10 loops, best of 10: 1.16 s per loop

In [149]: vol.flags
Out[149]:
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

In [150]: vol0.flags
Out[150]:
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

In [9]: %timeit -r 10 numpy.fft.fft(vol0, axis=0)
10 loops, best of 10: 939 ms per loop

### mean ###
In [173]: %timeit -r 10 vol.mean()
10 loops, best of 10: 272 ms per loop

In [174]: %timeit -r 10 vol0.mean()
10 loops, best of 10: 683 ms per loop

### max ###
In [175]: %timeit -r 10 vol.max()
10 loops, best of 10: 63.8 ms per loop

In [176]: %timeit -r 10 vol0.max()
10 loops, best of 10: 475 ms per loop

### min ###
In [177]: %timeit -r 10 vol.min()
10 loops, best of 10: 63.8 ms per loop

In [178]: %timeit -r 10 vol0.min()
10 loops, best of 10: 476 ms per loop

### rot90 ###
In [10]: %timeit -r 10 numpy.rot90(vol)
100000 loops, best of 10: 6.97 s per loop

In [12]: %timeit -r 10 numpy.rot90(vol0)
100000 loops, best of 10: 6.92 s per loop

-- 
Ciao, /  /
     /--/
    /  / ANS


More information about the Numpy-discussion mailing list