# [Numpy-discussion] Fwd: GPU Numpy

Sturla Molden sturla@molden...
Wed Sep 9 23:47:47 CDT 2009

```James Bergstra skrev:
> Suppose you want to evaluate "dot(a*b+c*sqrt(d), e)".  The GPU is
> great for doing dot(),
The CPU is equally great (or better?) for doing dot(). In both cases:

- memory access scale O(n) for dot producs.
- computation scale O(n) for dot producs.
- memory is low
- computation is fast (faster for GPU)

In both cases, the floating point unit is starved. That means it could
do a lot more work if memory were faster.

For the GPU to be "faster than CPU", you have to have a situation where
computation dominates over memory access. Matrix-matrix multiplication
is one such example. This is what GPUs are designed to do, as it is the
major bootleneck in 3D graphics.

The proper way to speed up "dot(a*b+c*sqrt(d), e)" is to get rid of
temporary intermediates. That is, in Python pseudo-code:

result = 0
for i in range(n):
result += (a[i]*b[i] + c[i]*sqrt(d[i])) * e[i]

tmp0 = empty(n)
for i in range(n):
tmp0[i] = a[i] * b[i]

tmp1 = empty(n)
for i in range(n):
tmp1[i] = sqrt(d[i])

tmp2 = empty(n)
for i in range(n):
tmp2[i] = c[i] * tmp1[i]

tmp3 = empty(n)
for i in range(n):
tmp3[i]  = tmp0[i] + tmp2[i]

result = 0
for i in range(n):
result += tmp3[i] * e[i]

It is this complication that makes NumPy an order of magnitude slower
than hand-crafted C (but still much faster than pure Python!) Adding in
GPUs will not change this. The amount of computation (flop count) is the
same, so it cannot be the source of the slowness.

Sturla Molden

```