[Numpy-discussion] NumPy speed tests by NASA
Sturla Molden
sturla@molden...
Tue Feb 22 18:21:04 CST 2011
Den 23.02.2011 00:19, skrev Gökhan Sever:
>
> I am guessing ATLAS is thread aware since with N=15000 each of the
> quad core runs at %100. Probably MKL build doesn't bring much speed
> advantage in this computation. Any thoughts?
>
There are still things like optimal cache use, SIMD extensions, etc. to
consider. Some of MKL is hand-tweaked assemby and e.g. very fast on iCore.
Other BLAS implementations to consider are ACML, GotoBLAS2, ACML-GPU,
and CUBLAS.
GotoBLAS2 is currently the fastest BLAS implementation on x64 CPUs. It
can e.g. be linked with the reference implementation of LAPACK. GotoBLAS
is open source and is very easy to build ("just type make").
ACML is probably better than MKL on AMD processors, but not as good as
MKL on Intel processors, and currently free of charge (an MKL license
costs $399).
Tthe recet ACML-GPU library can move matrix multiplication (DGEMM and
friends) to the GPU if there is an ATI (AMD) chip available, and the
matrices are sufficiently large. The ATI GPU can also be programmed with
OpenCL, but ACML-GPU just looks like an ordinary BLAS and LAPACK
implementation (in addition to FFTs and PRNGs), so no special
programming is needed.
If one has an nVidia GPU, there is the CUBLAS library which implements
BLAS, but not LAPACK. It has Fortran bindings and can probably be used
with a reference LAPACK.
Sturla
More information about the NumPy-Discussion
mailing list