[Numpy-discussion] numpy arrays, data allocation and SIMD alignement
Charles R Harris
Sat Aug 4 00:30:46 CDT 2007
On 8/3/07, David Cournapeau <firstname.lastname@example.org> wrote:
> Andrew Straw wrote:
> > Dear David,
> > Both ideas, particularly the 2nd, would be excellent additions to numpy.
> > I often use the Intel IPP (Integrated Performance Primitives) Library
> > together with numpy, but I have to do all my memory allocation with the
> > IPP to ensure fastest operation. I then create numpy views of the data.
> > All this works brilliantly, but it would be really nice if I could
> > allocate the memory directly in numpy.
> > IPP allocates, and says it wants, 32 byte aligned memory (see, e.g.
> > http://www.intel.com/support/performancetools/sb/CS-021418.htm ). Given
> > that fftw3 apparently wants 16 byte aligned memory, my feeling is that,
> > if the effort is made, the alignment width should be specified at
> > run-time, rather than hard-coded.
> I think that doing it at runtime would be overkill, no ? I was thinking
> about making it a compile option. Generally, at the ASM level, you need
> 16 bytes alignment (for instructions like movaps, which takes 16 bytes
> in memory and put it in the SSE registers), this is not just fftw. Maybe
> the 32 bytes alignment is useful for cache reasons, I don't know.
> I don't think it would be difficult to implement and validate; what I
> don't know at all is the implication of this at the binary level, if any.
Here's a hack that google turned up:
(1) Use static variables instead of dynamic (stack) variables
(2) Use in-line assembly code that explicitly aligns data
(3) In C code, use "*malloc*" to explicitly allocate variables
Here is Intel's example of (2):
; procedure prologue
mov esp, ebp
and ebp, -8
sub esp, 12
; procedure epilogue
add esp, 12
Intel's example of (3), slightly modified:
double *p, *newp;
p = (double*)*malloc* ((sizeof(double)*NPTS)+4);
newp = (p+4) & (~7);
This assures that newp is 8-*byte* aligned even if p is not. However,
*malloc*() may already follow Intel's recommendation that a *32*-*byte* or
greater data structures be aligned on a *32* *byte* boundary. In that case,
increasing the requested memory by 4 bytes and computing newp are
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Numpy-discussion