[Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays

David Cournapeau david@ar.media.kyoto-u.ac...
Thu Nov 8 06:44:59 CST 2007


Hans Meine wrote:
> Hi!
>
> I wonder why simple elementwise operations like "a * 2" or "a + 1" are not 
> performed in order of increasing memory addresses in order to exploit CPU 
> caches etc. - as it is now, their speed drops by a factor of around 3 simply 
> by transpose()ing. 
Because it is not trivial to do so in all cases, I guess. It is a 
problem which comes back time to time on the ML, but AFAIK, nobody had a 
fix for it. Fundamentally, for many element-wise operations, either you 
have to implement the thing for every possible case, or you abstract it 
through an iterator, which gives you a decrease of performances in some 
cases. There are also cases where the current implementation is far from 
optimal, for lack of man power I guess (taking a look at PyArray_Mean, 
for example, shows that it uses PyArray_GenericReduceFunction, which is 
really slow compare to a straight C implementation).
>  Similarly (but even less logical), copy() and even the 
> constructor are affected (yes, I understand that copy() creates contiguous 
> arrays, but shouldn't it respect/retain the order nevertheless?):
>   
I don't see why it is illogical: when you do a copy, you don't preserve 
memory layout, and so a simple memcpy of the whole buffer is not possible.

cheers,

David
> ### constructor ###
> In [89]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3))
> 1000000 loops, best of 10: 1.19 s per loop
>
> In [90]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3), order="f")
> 1000000 loops, best of 10: 2.19 s per loop
>
> ### copy 3x3x3 array ###
> In [85]: a = numpy.ndarray((3,3,3))
>
> In [86]: %timeit -r 10 a.copy()
> 1000000 loops, best of 10: 1.14 s per loop
>
> In [87]: a = numpy.ndarray((3,3,3), order="f")
>
> In [88]: %timeit -r 10 -n 1000000 a.copy()
> 1000000 loops, best of 10: 3.39 s per loop
>
> ### copy 256x256x256 array ###
> In [74]: a = numpy.ndarray((256,256,256))
>
> In [75]: %timeit -r 10 a.copy()
> 10 loops, best of 10: 119 ms per loop
>
> In [76]: a = numpy.ndarray((256,256,256), order="f")
>
> In [77]: %timeit -r 10 a.copy()
> 10 loops, best of 10: 274 ms per loop
>
> ### fill ###
> In [79]: a = numpy.ndarray((256,256,256))
>
> In [80]: %timeit -r 10 a.fill(0)
> 10 loops, best of 10: 60.2 ms per loop
>
> In [81]: a = numpy.ndarray((256,256,256), order="f")
>
> In [82]: %timeit -r 10 a.fill(0)
> 10 loops, best of 10: 60.2 ms per loop
>
> ### power ###
> In [151]: a = numpy.ndarray((256,256,256))
>
> In [152]: %timeit -r 10 a ** 2
> 10 loops, best of 10: 124 ms per loop
>
> In [153]: a = numpy.asfortranarray(a)
>
> In [154]: %timeit -r 10 a ** 2
> 10 loops, best of 10: 458 ms per loop
>
> ### addition ###
> In [160]: a = numpy.ndarray((256,256,256))
>
> In [161]: %timeit -r 10 a + 1
> 10 loops, best of 10: 139 ms per loop
>
> In [162]: a = numpy.asfortranarray(a)
>
> In [163]: %timeit -r 10 a + 1
> 10 loops, best of 10: 465 ms per loop
>
> ### fft ###
> In [146]: %timeit -r 10 numpy.fft.fft(vol, axis=0)
> 10 loops, best of 10: 1.16 s per loop
>
> In [148]: %timeit -r 10 numpy.fft.fft(vol0, axis=2)
> 10 loops, best of 10: 1.16 s per loop
>
> In [149]: vol.flags
> Out[149]:
>   C_CONTIGUOUS : True
>   F_CONTIGUOUS : False
>   OWNDATA : True
>   WRITEABLE : True
>   ALIGNED : True
>   UPDATEIFCOPY : False
>
> In [150]: vol0.flags
> Out[150]:
>   C_CONTIGUOUS : False
>   F_CONTIGUOUS : True
>   OWNDATA : False
>   WRITEABLE : True
>   ALIGNED : True
>   UPDATEIFCOPY : False
>
> In [9]: %timeit -r 10 numpy.fft.fft(vol0, axis=0)
> 10 loops, best of 10: 939 ms per loop
>
> ### mean ###
> In [173]: %timeit -r 10 vol.mean()
> 10 loops, best of 10: 272 ms per loop
>
> In [174]: %timeit -r 10 vol0.mean()
> 10 loops, best of 10: 683 ms per loop
>
> ### max ###
> In [175]: %timeit -r 10 vol.max()
> 10 loops, best of 10: 63.8 ms per loop
>
> In [176]: %timeit -r 10 vol0.max()
> 10 loops, best of 10: 475 ms per loop
>
> ### min ###
> In [177]: %timeit -r 10 vol.min()
> 10 loops, best of 10: 63.8 ms per loop
>
> In [178]: %timeit -r 10 vol0.min()
> 10 loops, best of 10: 476 ms per loop
>
> ### rot90 ###
> In [10]: %timeit -r 10 numpy.rot90(vol)
> 100000 loops, best of 10: 6.97 s per loop
>
> In [12]: %timeit -r 10 numpy.rot90(vol0)
> 100000 loops, best of 10: 6.92 s per loop
>
>   



More information about the Numpy-discussion mailing list