[Numpy-discussion] Accelerating NumPy computations [Was: GPU Numpy]

Francesc Alted faltet@pytables....
Thu Aug 20 06:07:23 CDT 2009

El dj 20 de 08 del 2009 a les 00:37 -0700, en/na Erik Tollerud va
> > NumPy arrays on the GPU memory is an easy task. But then I would have to
> > write the computation in OpenCL's dialect of C99? But I'd rather program
> > everything in Python if I could. Details like GPU and OpenCL should be
> > hidden away. Nice looking Python with NumPy is much easier to read and
> > write. That is why I'd like to see a code generator (i.e. JIT compiler)
> > for NumPy.
> This is true to some extent, but also probably difficult to do given
> the fact that paralellizable algorithms are generally more difficult
> to formulate in striaghtforward ways.  In the intermediate-term, I
> think there is value in having numpy implement some sort of interface
> to OpenCL or cuda - I can easily see an explosion of different
> bindings (it's already starting), and having a "canonical" way encoded
> in numpy or scipy is probably the best way to mitigate the inevitable
> compatibility problems... I'm partial to the way pycuda can do it
> (basically, just export numpy arrays to the GPU and let you write the
> code from there), but the main point is to just get some basic
> compatibility in pretty quickly, as I think this GPGPU is here to
> stay...

Maybe.  However I think that we should not forget the fact that, as
Stula pointed out, the main bottleneck for *many* problems nowadays is
memory access, not CPU speed.  GPUs may have faster memory, but only a
few % better than main stream memory.  I'd like to hear from anyone here
having achieved any kind of speed-up in their calculations by using GPUs
instead of CPUs.  By looking at these scenarios we may get an idea of
where GPUs can be useful, and if driving an effort for give support for
them in NumPy would be worth the effort.

I personally think that, in general, exposing GPU capabilities directly
to NumPy would provide little service for most NumPy users.  I rather
see letting this task to specialized libraries (like PyCUDA, or special
versions of ATLAS, for example) that can be used from NumPy.

Until then, I think that a more direct approach (and one that would
deliver results earlier) for speeding-up NumPy is to be aware of the
hierachical nature of the different memory levels in current CPU's and
make NumPy to play nicely with it.  In that sense, I think that applying
the blocking technique (see [1] for a brief explanation) for taking
advantage of both spatial and temporal localities is the way to go.  For
example, most part of the speed-up that Numexpr achieves comes from the
fact that it uses blocking during the evaluation of complex expressions.
This is so because the temporaries are kept small and can fit in current
CPU caches.  Implementing similar algorithms in NumPy should not be that
difficult, most specially now that it already exists the Numexpr
implementation as a model.

And another thing that may further help to fight memory slowness (or
CPU/GPU quickness, as you prefer ;-) in the next future is compression.
Compression already helped bringing data faster from disk to CPU in the
last 10 years, and now, it is almost time that this can happen with the
memory too, not only disk.
In [1] I demonstrated that compression can *already* help transmitting
data in memory to CPU.  Agreed, right now this is only true for highly
compressible data (which is an important corner case anyway), but in the
short future we will see how the compression technique would be able to
accelerate computations for a high variety of datasets, even if they are
not very compressible.

So, in my humble opinion, implementing the possibility that NumPy can
deal with compressed buffers in addition to uncompressed ones, could be
very interesting in the short future (or even now, in specific

[1] http://www.pytables.org/docs/StarvingCPUs.pdf


More information about the NumPy-Discussion mailing list