[Numpy-discussion] Byte aligned arrays
Fri Dec 21 04:58:45 CST 2012
On Fri, 2012-12-21 at 11:34 +0100, Francesc Alted wrote:
> > Also this convolution code:
> > https://github.com/hgomersall/SSE-convolution/blob/master/convolve.c
> > Shows a small but repeatable speed-up (a few %) when using some
> > loads (as many as I can work out to use!).
> Okay, so a 15% is significant, yes. I'm still wondering why I did
> get any speedup at all using MKL, but probably the reason is that it
> manages the unaligned corners of the datasets first, and then uses an
> aligned access for the rest of the data (but just guessing here).
With SSE in that convolution code example above (in which all alignments
need be considered for each output element), I note a significant
speedup by creating 4 copies of the float input array using memcopy,
each shifted by 1 float (so the 5th element is aligned again). Despite
all the extra copies its still quicker than using an unaligned load.
However, when one tries the same trick with 8 copies for AVX it's
actually slower than the SSE case.
The fastest AVX (and any) implementation I have so far is with
16-aligned arrays (made with 4 copies as with SSE), with alternate
aligned and unaligned loads (which is always at worst 16-byte aligned).
More information about the NumPy-Discussion