[Numpy-discussion] Fwd: GPU Numpy

Sturla Molden sturla@molden...
Thu Aug 6 15:57:50 CDT 2009

> Now linear algebra or FFTs on a GPU would probably be a huge boon, 
> I'll admit - especially if it's in the form of a drop-in replacement 
> for the numpy or scipy versions.

NumPy generate temporary arrays for expressions involving ndarrays. This 
extra allocation and copying often takes more time than the computation. 
With GPGPUs, we have to bus the data to and from VRAM as well. D. Knuth 
quoted Hoare saying that "premature optimization is the root of all 
evil." Optimizing computation when the bottleneck is memory is premature.

In order to improve on this, I think we have to add "lazy evaluation" to 
NumPy. That is, an operator should not return a temporary array but a 
symbolic expression. So if we have an expression like

    y = a*x + b

it should not evalute a*x into a temporary array. Rather, the operators 
would build up a "parse tree" like

    y = add(multiply(a,x),b)

and evalute the whole expression  later on.

This would require two things: First we need "dynamic code generation", 
which incidentally is what OpenCL is all about. I.e. OpenCL is 
dynamically invoked compiler; there is a function 
clCreateProgramFromSource, which  does just what it says. Second, we 
need arrays to be immutable. This is very important. If arrays are not 
immutable, code like this could fail:

    y = a*x + b
    x[0] = 1235512371235

With lazy evaluation, the memory overhead would be much smaller. The 
GPGPU would also get a more complex expressions to use as a kernels.

There should be an option of running this on the CPU, possibly using 
OpenMP for multi-threading. We could either depend on a compiler (C or 
Fortran) being installed, or use opcodes for a dedicated virtual machine 
(cf. what numexpr does).

In order to reduce the effect of immutable arrays, we could introduce a 
context-manager. Inside the with statement, all arrays would be 
immutable. Second, the __exit__ method could trigger the code generator 
and do all the evaluation. So we would get something like this:

    # normal numpy here

    with numpy.accelerator():

        # arrays become immutable
        # lazy evaluation
        # code generation and evaluation on exit

    # normal numpy continues here

Thus, here is my plan:

1. a special context-manager class
2. immutable arrays inside with statement
3. lazy evaluation: expressions build up a parse tree
4. dynamic code generation
5. evaluation on exit

I guess it is possibly to find ways to speed up this as well. If a 
context manager would always generate the same OpenCL code, the with 
statement would only need to execute once (we could raise an exception 
on enter to jump directly to exit).

It is possibly to create a superfast NumPy. But just plugging GPGPUs 
into the current design would be premature. In NumPy's current state, 
with mutable ndarrays and operators generating temporary arrays, there 
is not much to gain from introducing GPGPUs. It would only be beneficial 
in computationally demanding parts like FFTs and solvers for linear 
algebra and differential equations. Ufuncs with trancendental functions 
might also benefit. SciPy would certainly benefit more from GPGPUs than 

Just my five cents :-)

Sturla Molden

More information about the NumPy-Discussion mailing list