[Numpy-discussion] object array alignment issues

Francesc Alted faltet@pytables....
Sat Oct 17 06:20:38 CDT 2009


A Friday 16 October 2009 18:05:05 Sturla Molden escrigué:
> Francesc Alted skrev:
> > The response is clear: avoid memcpy() if you can.  It is true that
> > memcpy() performance has improved quite a lot in latest gcc (it has been
> > quite good in Win versions since many years ago), but working with data
> > in-place (i.e. avoiding a memory copy) is always faster (and most
> > specially for large arrays that don't fit in cache processors).
> >
> > My own experiments says that, with an Intel Core2 processor the typical
> > speed- ups for avoiding memcpy() are 2x.
>
> If the underlying array is strided, I have seen the opposite as well.
> "Copy-in copy-out" is a common optimization used by Fortran compilers
> when working with strided arrays. The catch is that the work array has
> to fit in cache for this to make any sence. Anyhow, you cannot use
> memcpy for this kind of optimization - it assumes both buffers are
> contiguous. But working with arrays directly instead of copies is not
> always the faster option.

Mmh, don't know about Fortran (too many years without programming it), but in 
C it seems evident that performing a memcpy() is always slower, at least with 
modern CPUs (like the Intel Core2 that I'm using now):

In [43]: import numpy as np

In [44]: import numexpr as ne

In [45]: r = np.zeros(1e6, 'i1,i4,f8')

In [46]: f1, f2 = r['f1'], r['f2']

In [47]: f1.flags.aligned, f2.flags.aligned
Out[47]: (False, False)                    

In [48]: timeit f1*f2      # NumPy do copies before carrying out operations
100 loops, best of 3: 14.6 ms per loop

In [49]: timeit ne.evaluate('f1*f2')   # numexpr uses plain unaligned access
100 loops, best of 3: 5.77 ms per loop   # 2.5x faster than numpy

Using strides, the result is similar:

In [50]: f1, f2 = r['f1'][::2], r['f2'][::2]  # check with strides

In [51]: f1.flags.aligned, f2.flags.aligned
Out[51]: (False, False)

In [52]: timeit f1*f2
100 loops, best of 3: 7.52 ms per loop

In [53]: timeit ne.evaluate('f1*f2')
100 loops, best of 3: 3.96 ms per loop   # 1.9x faster than numpy

And, when using large strides so that the resulting arrays fit in cache:

In [54]: f1, f2 = r['f1'][::10], r['f2'][::10]  # big stride (fits in cache)

In [55]: timeit f1*f2
100 loops, best of 3: 3.51 ms per loop

In [56]: timeit ne.evaluate('f1*f2')
100 loops, best of 3: 2.61 ms per loop  # 34% faster than numpy

Which, although not much, still gives an advantage to the direct approach.
So, at least in C, operating with unaligned data on (modern) AMD/Intel 
processors seems to be fastest (at least in this quick-and-dirty benchmark).   
In fact, performance is very close to contiguous and aligned data:

In [58]: f1, f2 = r['f1'].copy(), r['f2'].copy()   # aligned and contiguous

In [59]: timeit f1*f2
100 loops, best of 3: 5.2 ms per loop

In [60]: timeit ne.evaluate('f1*f2')
100 loops, best of 3: 4.74 ms per loop

so 5.77 ms (unaligned data, In [49]) is not very far from 4.74 ms (aligned 
data, In [60]) and close to 'optimal' numpy performance (5.2 ms, In [59]).  
And, as I said before, the plans of AMD/Intel are to reduce this gap still 
further.

For unaligned arrays that fits in cache the results are even more dramatic:

In [61]: r = np.zeros(1e5, 'i1,i4,f8')

In [62]: f1, f2 = r['f1'], r['f2']

In [63]: timeit f1*f2
1000 loops, best of 3: 1.37 ms per loop

In [64]: timeit ne.evaluate('f1*f2')
1000 loops, best of 3: 293 µs per loop  #  4.7x speedup

but not sure why...

Cheers,

-- 
Francesc Alted


More information about the NumPy-Discussion mailing list