[Numpy-discussion] Fwd: GPU Numpy

Sturla Molden sturla@molden...
Wed Sep 9 23:47:47 CDT 2009


James Bergstra skrev:
> Suppose you want to evaluate "dot(a*b+c*sqrt(d), e)".  The GPU is
> great for doing dot(), 
The CPU is equally great (or better?) for doing dot(). In both cases:

- memory access scale O(n) for dot producs.
- computation scale O(n) for dot producs.
- memory is low
- computation is fast (faster for GPU)

In both cases, the floating point unit is starved. That means it could 
do a lot more work if memory were faster.

For the GPU to be "faster than CPU", you have to have a situation where 
computation dominates over memory access. Matrix-matrix multiplication 
is one such example. This is what GPUs are designed to do, as it is the 
major bootleneck in 3D graphics.

The proper way to speed up "dot(a*b+c*sqrt(d), e)" is to get rid of 
temporary intermediates. That is, in Python pseudo-code:

result = 0
for i in range(n):
    result += (a[i]*b[i] + c[i]*sqrt(d[i])) * e[i]

instead of:

tmp0 = empty(n)
for i in range(n):
   tmp0[i] = a[i] * b[i]

tmp1 = empty(n)
for i in range(n):
   tmp1[i] = sqrt(d[i])

tmp2 = empty(n)
for i in range(n):
   tmp2[i] = c[i] * tmp1[i]

tmp3 = empty(n)
for i in range(n):
   tmp3[i]  = tmp0[i] + tmp2[i]

result = 0
for i in range(n): 
   result += tmp3[i] * e[i]


It is this complication that makes NumPy an order of magnitude slower 
than hand-crafted C (but still much faster than pure Python!) Adding in 
GPUs will not change this. The amount of computation (flop count) is the 
same, so it cannot be the source of the slowness.


Sturla Molden





More information about the NumPy-Discussion mailing list