[Numpy-discussion] Byte aligned arrays
Thu Dec 20 12:35:20 CST 2012
On Thu, 2012-12-20 at 15:23 +0100, Francesc Alted wrote:
> On 12/20/12 9:53 AM, Henry Gomersall wrote:
> > On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote:
> >> The only scenario that I see that this would create unaligned
> >> is
> >> for machines having AVX. But provided that the Intel architecture
> >> making great strides in fetching unaligned data, I'd be surprised
> >> that
> >> the difference in performance would be even noticeable.
> >> Can you tell us which difference in performance are you seeing for
> >> AVX-aligned array and other that is not AVX-aligned? Just curious.
> > Further to this point, from an Intel article...
> > "Aligning data to vector length is always recommended. When using
> > SSE and Intel SSE2 instructions, loaded data should be aligned to 16
> > bytes. Similarly, to achieve best results use Intel AVX instructions
> > 32-byte vectors that are 32-byte aligned. The use of Intel AVX
> > instructions on unaligned 32-byte vectors means that every second
> > will be across a cache-line split, since the cache line is 64 bytes.
> > This doubles the cache line split rate compared to Intel SSE code
> > uses 16-byte vectors. A high cache-line split rate in
> > code is extremely likely to cause performance degradation. For that
> > reason, it is highly recommended to align the data to 32 bytes for
> > with Intel AVX."
> > Though it would be nice to put together a little example of this!
> Indeed, an example is what I was looking for. So provided that I
> access to an AVX capable machine (having 6 physical cores), and that
> 10.3 has support for AVX, I have made some comparisons using the
> Anaconda Python distribution (it ships with most packages linked
> MKL 10.3).
> All in all, it is not clear that AVX alignment would have an
> even for memory-bounded problems. But of course, if Intel people are
> saying that AVX alignment is important is because they have use cases
> for asserting this. It is just that I'm having a difficult time to
> these cases.
Thanks for those examples, they were very interesting. I managed to
temporarily get my hands on a machine with AVX and I have shown some
speed-up with aligned arrays.
FFT (using my wrappers) gives about a 15% speedup.
Also this convolution code:
Shows a small but repeatable speed-up (a few %) when using some aligned
loads (as many as I can work out to use!).
More information about the NumPy-Discussion