[Numpy-discussion] Fwd: GPU Numpy
Sturla Molden
sturla@molden...
Wed Sep 9 23:47:47 CDT 2009
James Bergstra skrev:
> Suppose you want to evaluate "dot(a*b+c*sqrt(d), e)". The GPU is
> great for doing dot(),
The CPU is equally great (or better?) for doing dot(). In both cases:
- memory access scale O(n) for dot producs.
- computation scale O(n) for dot producs.
- memory is low
- computation is fast (faster for GPU)
In both cases, the floating point unit is starved. That means it could
do a lot more work if memory were faster.
For the GPU to be "faster than CPU", you have to have a situation where
computation dominates over memory access. Matrix-matrix multiplication
is one such example. This is what GPUs are designed to do, as it is the
major bootleneck in 3D graphics.
The proper way to speed up "dot(a*b+c*sqrt(d), e)" is to get rid of
temporary intermediates. That is, in Python pseudo-code:
result = 0
for i in range(n):
result += (a[i]*b[i] + c[i]*sqrt(d[i])) * e[i]
instead of:
tmp0 = empty(n)
for i in range(n):
tmp0[i] = a[i] * b[i]
tmp1 = empty(n)
for i in range(n):
tmp1[i] = sqrt(d[i])
tmp2 = empty(n)
for i in range(n):
tmp2[i] = c[i] * tmp1[i]
tmp3 = empty(n)
for i in range(n):
tmp3[i] = tmp0[i] + tmp2[i]
result = 0
for i in range(n):
result += tmp3[i] * e[i]
It is this complication that makes NumPy an order of magnitude slower
than hand-crafted C (but still much faster than pure Python!) Adding in
GPUs will not change this. The amount of computation (flop count) is the
same, so it cannot be the source of the slowness.
Sturla Molden
More information about the NumPy-Discussion
mailing list