[Numpy-discussion] Fwd: GPU Numpy
Sturla Molden
sturla@molden...
Thu Aug 6 15:57:50 CDT 2009
> Now linear algebra or FFTs on a GPU would probably be a huge boon,
> I'll admit - especially if it's in the form of a drop-in replacement
> for the numpy or scipy versions.
NumPy generate temporary arrays for expressions involving ndarrays. This
extra allocation and copying often takes more time than the computation.
With GPGPUs, we have to bus the data to and from VRAM as well. D. Knuth
quoted Hoare saying that "premature optimization is the root of all
evil." Optimizing computation when the bottleneck is memory is premature.
In order to improve on this, I think we have to add "lazy evaluation" to
NumPy. That is, an operator should not return a temporary array but a
symbolic expression. So if we have an expression like
y = a*x + b
it should not evalute a*x into a temporary array. Rather, the operators
would build up a "parse tree" like
y = add(multiply(a,x),b)
and evalute the whole expression later on.
This would require two things: First we need "dynamic code generation",
which incidentally is what OpenCL is all about. I.e. OpenCL is
dynamically invoked compiler; there is a function
clCreateProgramFromSource, which does just what it says. Second, we
need arrays to be immutable. This is very important. If arrays are not
immutable, code like this could fail:
y = a*x + b
x[0] = 1235512371235
With lazy evaluation, the memory overhead would be much smaller. The
GPGPU would also get a more complex expressions to use as a kernels.
There should be an option of running this on the CPU, possibly using
OpenMP for multi-threading. We could either depend on a compiler (C or
Fortran) being installed, or use opcodes for a dedicated virtual machine
(cf. what numexpr does).
In order to reduce the effect of immutable arrays, we could introduce a
context-manager. Inside the with statement, all arrays would be
immutable. Second, the __exit__ method could trigger the code generator
and do all the evaluation. So we would get something like this:
# normal numpy here
with numpy.accelerator():
# arrays become immutable
# lazy evaluation
# code generation and evaluation on exit
# normal numpy continues here
Thus, here is my plan:
1. a special context-manager class
2. immutable arrays inside with statement
3. lazy evaluation: expressions build up a parse tree
4. dynamic code generation
5. evaluation on exit
I guess it is possibly to find ways to speed up this as well. If a
context manager would always generate the same OpenCL code, the with
statement would only need to execute once (we could raise an exception
on enter to jump directly to exit).
It is possibly to create a superfast NumPy. But just plugging GPGPUs
into the current design would be premature. In NumPy's current state,
with mutable ndarrays and operators generating temporary arrays, there
is not much to gain from introducing GPGPUs. It would only be beneficial
in computationally demanding parts like FFTs and solvers for linear
algebra and differential equations. Ufuncs with trancendental functions
might also benefit. SciPy would certainly benefit more from GPGPUs than
NumPy.
Just my five cents :-)
Regards,
Sturla Molden
More information about the NumPy-Discussion
mailing list