[Numpy-discussion] Accelerating NumPy computations [Was: GPU Numpy]

Frédéric Bastien nouiz@nouiz....
Fri Aug 21 09:01:26 CDT 2009


On Thu, Aug 20, 2009 at 7:07 AM, Francesc Alted <faltet@pytables.org> wrote:

> El dj 20 de 08 del 2009 a les 00:37 -0700, en/na Erik Tollerud va
> escriure:
> > > NumPy arrays on the GPU memory is an easy task. But then I would have
> to
> > > write the computation in OpenCL's dialect of C99? But I'd rather
> program
> > > everything in Python if I could. Details like GPU and OpenCL should be
> > > hidden away. Nice looking Python with NumPy is much easier to read and
> > > write. That is why I'd like to see a code generator (i.e. JIT compiler)
> > > for NumPy.
> >
> > This is true to some extent, but also probably difficult to do given
> > the fact that paralellizable algorithms are generally more difficult
> > to formulate in striaghtforward ways.  In the intermediate-term, I
> > think there is value in having numpy implement some sort of interface
> > to OpenCL or cuda - I can easily see an explosion of different
> > bindings (it's already starting), and having a "canonical" way encoded
> > in numpy or scipy is probably the best way to mitigate the inevitable
> > compatibility problems... I'm partial to the way pycuda can do it
> > (basically, just export numpy arrays to the GPU and let you write the
> > code from there), but the main point is to just get some basic
> > compatibility in pretty quickly, as I think this GPGPU is here to
> > stay...
> Maybe.  However I think that we should not forget the fact that, as
> Stula pointed out, the main bottleneck for *many* problems nowadays is
> memory access, not CPU speed.  GPUs may have faster memory, but only a
> few % better than main stream memory.  I'd like to hear from anyone here
> having achieved any kind of speed-up in their calculations by using GPUs
> instead of CPUs.  By looking at these scenarios we may get an idea of
> where GPUs can be useful, and if driving an effort for give support for
> them in NumPy would be worth the effort.

I have around 10x speed up in convolution. I compare again my own version on
the cpu that is 20-30x faster then the version in scipy... I should backport
some of my optimisation(not all possible as I remove some case), but I
didn't get the time. The GPU are the most usefull when the bottleneck is the
cpu, not the memory and the probleme must be highly parallel. In that case
reported speed-up of 100-200x have been reported. But take those number with
a grain of salt, many of them don't talk much about the cpu implementation.
In that case, they probable compare a highly optimized version on the GPU
again a not optimised version on the CPU. I have see a case where they don't
tell withch version of blas they used on the cpu for matrix multiplication.
So this can be that they just forget to tell that they used an optimized one
or that they didn't used them. In the last case, the speed-up don't have any

> I personally think that, in general, exposing GPU capabilities directly
> to NumPy would provide little service for most NumPy users.  I rather
> see letting this task to specialized libraries (like PyCUDA, or special
> versions of ATLAS, for example) that can be used from NumPy.

specialized library can be a good start as currently their is too much
incertitude in the language(opencl vs nvidia api driver(pycuda, but not
cublas, cufft,...) vs c-cuda(cublas, cufft))

One think that could help all those specialized libraries(I make one with
James B. cuda_ndarray) is to have a standardized version of NDarray for the
gpu. But I'm not shure it is a good time to do it now.

> Until then, I think that a more direct approach (and one that would
> deliver results earlier) for speeding-up NumPy is to be aware of the
> hierachical nature of the different memory levels in current CPU's and
> make NumPy to play nicely with it.  In that sense, I think that applying
> the blocking technique (see [1] for a brief explanation) for taking
> advantage of both spatial and temporal localities is the way to go.  For
> example, most part of the speed-up that Numexpr achieves comes from the
> fact that it uses blocking during the evaluation of complex expressions.
> This is so because the temporaries are kept small and can fit in current
> CPU caches.  Implementing similar algorithms in NumPy should not be that
> difficult, most specially now that it already exists the Numexpr
> implementation as a model.
> And another thing that may further help to fight memory slowness (or
> CPU/GPU quickness, as you prefer ;-) in the next future is compression.
> Compression already helped bringing data faster from disk to CPU in the
> last 10 years, and now, it is almost time that this can happen with the
> memory too, not only disk.
> In [1] I demonstrated that compression can *already* help transmitting
> data in memory to CPU.  Agreed, right now this is only true for highly
> compressible data (which is an important corner case anyway), but in the
> short future we will see how the compression technique would be able to
> accelerate computations for a high variety of datasets, even if they are
> not very compressible.
> So, in my humble opinion, implementing the possibility that NumPy can
> deal with compressed buffers in addition to uncompressed ones, could be
> very interesting in the short future (or even now, in specific
> situations).
> [1] http://www.pytables.org/docs/StarvingCPUs.pdf

very interesting, optimized numpy on the cpu is a good think as not all algo
are well suited for the gpu and when we make new algo, doing it on the cpu
is MUCH easier today.

Frederic Bastien
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/numpy-discussion/attachments/20090821/1bfdb054/attachment.html 

More information about the NumPy-Discussion mailing list