[Numpy-discussion] Fwd: GPU Numpy

Francesc Alted faltet@pytables....
Wed Sep 9 04:18:48 CDT 2009

A Tuesday 08 September 2009 21:19:05 George Dahl escrigué:
> Sturla Molden <sturla <at> molden.no> writes:
> > Erik Tollerud skrev:
> > >> NumPy arrays on the GPU memory is an easy task. But then I would have
> > >> to write the computation in OpenCL's dialect of C99?
> > >
> > > This is true to some extent, but also probably difficult to do given
> > > the fact that paralellizable algorithms are generally more difficult
> > > to formulate in striaghtforward ways.
> >
> > Then you have misunderstood me completely. Creating an ndarray that has
> > a buffer in graphics memory is not too difficult, given that graphics
> > memory can be memory mapped. This has nothing to do with parallelizable
> > algorithms or not. It is just memory management. We could make an
> > ndarray subclass that quickly puts is content in a buffer accessible to
> > the GPU. That is not difficult. But then comes the question of what you
> > do with it.
> >
> > I think many here misunderstands the issue here:
> >
> > Teraflops peak performance of modern GPUs is impressive. But NumPy
> > cannot easily benefit from that. In fact, there is little or nothing to
> > gain from optimising in that end. In order for a GPU to help,
> > computation must be the time-limiting factor. It is not. There is not
> > more to say about using GPUs in NumPy right now.
> >
> > Take a look at the timings here: http://www.scipy.org/PerformancePython
> > It shows that computing with NumPy is more than ten times slower than
> > using plain C. This is despite NumPy being written in C. The NumPy code
> > does not incur 10 times more floating point operations than the C code.
> > The floating point unit does not run in turtle mode when using NumPy.
> > NumPy's relative slowness compared to C has nothing to do with floating
> > point computation. It is due to inferior memory use (temporary buffers,
> > multiple buffer traversals) and memory access being slow. Moving
> > computation to the GPU can only make this worse.
> >
> > Improved memory usage - e.g. through lazy evaluation and JIT compilaton
> > of expressions - can give up to a tenfold increase in performance. That
> > is where we must start optimising to get a faster NumPy. Incidentally,
> > this will  also make it easier to leverage on modern GPUs.
> >
> > Sturla Molden
> I know that for my work, I can get around an order of a 50-fold speedup
> over numpy using a python wrapper for a simple GPU matrix class.  So I
> might be dealing with a lot of matrix products where I multiply a fixed 512
> by 784 matrix by a 784 by 256 matrix that changes between each matrix
> product, although to really see the largest gains I use a 4096 by 2048
> matrix times a bunch of 2048 by 256 matrices.  If all I was doing were
> those matrix products, it would be even faster, but what I actually am
> doing is a matrix product, then adding a column vector to the result, then
> applying an elementwise logistic sigmoid function and potentially
> generating a matrix of pseudorandom numbers the same shape as my result
> (although not always).  When I do these sorts of workloads, my python
> numpy+GPU matrix class goes so much faster than anything that doesn't use
> the GPU (be it Matlab, or numpy, or C/C++ whatever) that I don't even
> bother measuring the speedups precisely.  In some cases, my python code
> isn't making too many temporaries since what it is doing is so simple, but
> in other cases that is obviously slowing it down a bit.  I have relatively
> complicated jobs that used to take weeks on the CPU can now take hours or
> days.
> Obviously improved memory usage would be more helpful since not everyone
> has access to the sorts of GPUs I use, but tenfold increases in performance
> seem like chump change compared to what I see with the sorts of workloads I
> do.

50-fold increases over NumPy+[Atlas|MKL] are really impressive.  However, the 
point is that these speed-ups can be achieved only when the ratio of 
operations per element is really huge.  Matrix-matrix multiplication (your 
example above) is a paradigmatic example of these scenarios, where 
computations are O(3) (or little smaller than 3, when optimized algorithms are 
used), while memory access is O(2).  Of course, when the matrices
are large, the ratio operations/elements is larger, allowing much better  
speed-ups; this is why GPUs really do a good job here.

The point here is that matrix-matrix multiplications (or, in general, 
functions with a large operation/element ratio) are a *tiny* part of all the 
possible operations between arrays that NumPy supports.  This is why Sturla is 
saying that it is not a good idea to include support of GPUs in all parts of 
NumPy.  A much better strategy is to give NumPy the possibility to link with 
external packages (à la BLAS, LAPACK, Atlas, MKL) that can leverage the 
powerful GPUs for specific problems (e.g. matrix-matrix multiplications).

Francesc Alted

More information about the NumPy-Discussion mailing list