[Numpy-discussion] object array alignment issues
Francesc Alted
faltet@pytables....
Sat Oct 17 06:20:38 CDT 2009
A Friday 16 October 2009 18:05:05 Sturla Molden escrigué:
> Francesc Alted skrev:
> > The response is clear: avoid memcpy() if you can. It is true that
> > memcpy() performance has improved quite a lot in latest gcc (it has been
> > quite good in Win versions since many years ago), but working with data
> > in-place (i.e. avoiding a memory copy) is always faster (and most
> > specially for large arrays that don't fit in cache processors).
> >
> > My own experiments says that, with an Intel Core2 processor the typical
> > speed- ups for avoiding memcpy() are 2x.
>
> If the underlying array is strided, I have seen the opposite as well.
> "Copy-in copy-out" is a common optimization used by Fortran compilers
> when working with strided arrays. The catch is that the work array has
> to fit in cache for this to make any sence. Anyhow, you cannot use
> memcpy for this kind of optimization - it assumes both buffers are
> contiguous. But working with arrays directly instead of copies is not
> always the faster option.
Mmh, don't know about Fortran (too many years without programming it), but in
C it seems evident that performing a memcpy() is always slower, at least with
modern CPUs (like the Intel Core2 that I'm using now):
In [43]: import numpy as np
In [44]: import numexpr as ne
In [45]: r = np.zeros(1e6, 'i1,i4,f8')
In [46]: f1, f2 = r['f1'], r['f2']
In [47]: f1.flags.aligned, f2.flags.aligned
Out[47]: (False, False)
In [48]: timeit f1*f2 # NumPy do copies before carrying out operations
100 loops, best of 3: 14.6 ms per loop
In [49]: timeit ne.evaluate('f1*f2') # numexpr uses plain unaligned access
100 loops, best of 3: 5.77 ms per loop # 2.5x faster than numpy
Using strides, the result is similar:
In [50]: f1, f2 = r['f1'][::2], r['f2'][::2] # check with strides
In [51]: f1.flags.aligned, f2.flags.aligned
Out[51]: (False, False)
In [52]: timeit f1*f2
100 loops, best of 3: 7.52 ms per loop
In [53]: timeit ne.evaluate('f1*f2')
100 loops, best of 3: 3.96 ms per loop # 1.9x faster than numpy
And, when using large strides so that the resulting arrays fit in cache:
In [54]: f1, f2 = r['f1'][::10], r['f2'][::10] # big stride (fits in cache)
In [55]: timeit f1*f2
100 loops, best of 3: 3.51 ms per loop
In [56]: timeit ne.evaluate('f1*f2')
100 loops, best of 3: 2.61 ms per loop # 34% faster than numpy
Which, although not much, still gives an advantage to the direct approach.
So, at least in C, operating with unaligned data on (modern) AMD/Intel
processors seems to be fastest (at least in this quick-and-dirty benchmark).
In fact, performance is very close to contiguous and aligned data:
In [58]: f1, f2 = r['f1'].copy(), r['f2'].copy() # aligned and contiguous
In [59]: timeit f1*f2
100 loops, best of 3: 5.2 ms per loop
In [60]: timeit ne.evaluate('f1*f2')
100 loops, best of 3: 4.74 ms per loop
so 5.77 ms (unaligned data, In [49]) is not very far from 4.74 ms (aligned
data, In [60]) and close to 'optimal' numpy performance (5.2 ms, In [59]).
And, as I said before, the plans of AMD/Intel are to reduce this gap still
further.
For unaligned arrays that fits in cache the results are even more dramatic:
In [61]: r = np.zeros(1e5, 'i1,i4,f8')
In [62]: f1, f2 = r['f1'], r['f2']
In [63]: timeit f1*f2
1000 loops, best of 3: 1.37 ms per loop
In [64]: timeit ne.evaluate('f1*f2')
1000 loops, best of 3: 293 µs per loop # 4.7x speedup
but not sure why...
Cheers,
--
Francesc Alted
More information about the NumPy-Discussion
mailing list