[Numpy-discussion] Objected-oriented SIMD API for Numpy

Sturla Molden sturla@molden...
Wed Oct 21 21:31:23 CDT 2009

Mathieu Blondel skrev:
> Hello,
> About one year ago, a high-level, objected-oriented SIMD API was added
> to Mono. For example, there is a class Vector4f for vectors of 4
> floats and this class implements methods such as basic operators,
> bitwise operators, comparison operators, min, max, sqrt, shuffle
> directly using SIMD operations.
I think you are confusing SIMD with Intel's MMX/SSE instruction set.

SIMD means "single instruction - multiple data". NumPy is interherently 
an object-oriented SIMD API:

  array1[:] = array2 + array3

is a SIMD instruction by definition.

SIMD instructions in hardware for length-4 vectors are mostly useful for 
3D graphics. But they are not used a lot for that purpose, because GPUs 
are getting common. SSE is mostly for rendering 3D graphics without a 
GPU. There is nothing that prevents NumPy from having a Vector4f dtype, 
that internally stores four float32 and is aligned at 16 byte 
boundaries. But it would not be faster than the current float32 dtype. 
Do you know why?

The reason is that memory access is slow, and computation is fast. 
Modern CPUs are starved. The speed of NumPy is not limited by not using 
MMX/SSE whenever possible. It is limited from having to create and 
delete temporary arrays all the time. You are suggesting to optimize in 
the wrong place. There is a lot that can be done to speed up 
computation: There are optimized BLAS libraries like ATLAS and MKL. 
NumPy uses BLAS for things like matrix multiplication. There are OpenMP 
for better performance on multicores. There are OpenCL and CUDA for 
moving computation from CPUs to GPU. But the main boost you get from 
going from NumPy to hand-written C or Fortran comes from reduced memory use.

> existing discussion here. Memory-alignment is an import related issue
> since non-aligned movs can tank the performance.

You can align an ndarray on 16-byte boundary like this:

def aligned_array(N, dtype):
     d = dtype()
     tmp = numpy.zeros(N * d.nbytes + 16, dtype=numpy.uint8)
     address = tmp.__array_interface__['data'][0]
     offset = (16 - address % 16) % 16
     return tmp[offset:offset+N].view(dtype=dtype)

Sturla Molden

More information about the NumPy-Discussion mailing list