[Numpy-discussion] Objected-oriented SIMD API for Numpy
Sturla Molden
sturla@molden...
Wed Oct 21 21:31:23 CDT 2009
Mathieu Blondel skrev:
> Hello,
>
> About one year ago, a high-level, objected-oriented SIMD API was added
> to Mono. For example, there is a class Vector4f for vectors of 4
> floats and this class implements methods such as basic operators,
> bitwise operators, comparison operators, min, max, sqrt, shuffle
> directly using SIMD operations.
I think you are confusing SIMD with Intel's MMX/SSE instruction set.
SIMD means "single instruction - multiple data". NumPy is interherently
an object-oriented SIMD API:
array1[:] = array2 + array3
is a SIMD instruction by definition.
SIMD instructions in hardware for length-4 vectors are mostly useful for
3D graphics. But they are not used a lot for that purpose, because GPUs
are getting common. SSE is mostly for rendering 3D graphics without a
GPU. There is nothing that prevents NumPy from having a Vector4f dtype,
that internally stores four float32 and is aligned at 16 byte
boundaries. But it would not be faster than the current float32 dtype.
Do you know why?
The reason is that memory access is slow, and computation is fast.
Modern CPUs are starved. The speed of NumPy is not limited by not using
MMX/SSE whenever possible. It is limited from having to create and
delete temporary arrays all the time. You are suggesting to optimize in
the wrong place. There is a lot that can be done to speed up
computation: There are optimized BLAS libraries like ATLAS and MKL.
NumPy uses BLAS for things like matrix multiplication. There are OpenMP
for better performance on multicores. There are OpenCL and CUDA for
moving computation from CPUs to GPU. But the main boost you get from
going from NumPy to hand-written C or Fortran comes from reduced memory use.
> existing discussion here. Memory-alignment is an import related issue
> since non-aligned movs can tank the performance.
>
>
You can align an ndarray on 16-byte boundary like this:
def aligned_array(N, dtype):
d = dtype()
tmp = numpy.zeros(N * d.nbytes + 16, dtype=numpy.uint8)
address = tmp.__array_interface__['data'][0]
offset = (16 - address % 16) % 16
return tmp[offset:offset+N].view(dtype=dtype)
Sturla Molden
More information about the NumPy-Discussion
mailing list