[Numpy-discussion] Fast threading solution thoughts
Michael Abshoff
michael.abshoff@googlemail....
Thu Feb 12 09:54:07 CST 2009
Nathan Bell wrote:
> On Thu, Feb 12, 2009 at 8:19 AM, Michael Abshoff
> <michael.abshoff@googlemail.com> wrote:
Hi,
>> No even close. The current generation peaks at around 1.2 TFlops single
>> precision, 280 GFlops double precision for ATI's hardware. The main
>> problem with those numbers is that the memory on the graphics card
>> cannot feed the data fast enough into the GPU to achieve theoretical
>> peak. So those hundreds of GFlops are pure marketing :)
>>
>
> If your application is memory bandwidth limited, then yes you're not
> likely to see 100s of GFlops anytime soon. However, compute limited
> application can and do achieve 100s of GFlops on GPUs. Basic
> operations like FFTs and (level 3) BLAS are compute limited, as are
> the following applications:
> http://www.ks.uiuc.edu/Research/gpu/
> http://www.dam.brown.edu/scicomp/scg-media/report_files/BrownSC-2008-27.pdf
Yes, certainly. But Sturla implied that some "random consumer GPU" (to
put a negative spin on it :) could do the above. There also seems to be
a huge expectation that "porting your code to the GPU" will make it 10
to 100 times faster. There are cases like that as mentioned above, but
this only applies to a subset of problems.
Another problem is RAM for many datasets I work with and 512 to 1024 MB
aren't just plainly cutting it. This means Tesla cards at $1k upward and
all the sudden we are playing a different game.
9 months ago at the beginning when we started playing with CUDA we took
a MacBook pro with a decent NVidia card and laughed hard after it become
clear that its Core2 with either ATLAS or the AccelerateFramework (which
is more or ATLAS for its BLAS bits) was faster than the build in NVidia
card with either single or double precision. Surely, this is a consumer
level laptop GPU, but I did expect more.
>> So in reality you might get anywhere from 20% to 60% (if you are lucky)
>> locally before accounting for transfers from main memory to GPU memory
>> and so on. Given that recent Intel CPUs give you about 7 to 11 Glops
>> Double per core and libraries like ATLAS give you that performance today
>> without the need to jump through hoops these number start to look a lot
>> less impressive.
>
> You neglect to mention that CPUs, which have roughly 1/10th the memory
> bandwidth of high-end GPUs, are memory bound on the very same
> problems. You will not see 7 to 11 GFLops on a memory bound CPU code
> for the same reason you argue that GPUs don't achieve 100s of GFLops
> on memory bound GPU codes.
I am seeing 7 to 11 GFLOP per core for matrix matrix multiplies on Intel
CPUs using Strassen for matrix matrix multiplies. And we did scale out
linear on 16 core Opterons as well as a 64 core Itanium box using ATLAS
for BLAS level 3 matrix matrix multiplu. When you have multiple GPUs you
do not have shared memory architectures (AFAIK the 4 GPU boxen sold by
NVidia have fast buses between the cards, but aren't ccNUMA or anything
like that - please correct me if I am wrong).
> In severely memory bound applications like sparse matrix-vector
> multiplication (i.e. A*x for sparse A) the best GPU performance you
> can expect is ~10 GFLops on the GPU and ~1 GFLop on the CPU (in double
> precision). We discuss this problem in the following tech report:
> http://forums.nvidia.com/index.php?showtopic=83825
Ok, I care about dense operations primarily, but it is interesting to
see that the GPU fares well on sparse LA.
> It's true that host<->device transfers can be a bottleneck. In many
> cases, the solution is to simply leave the data resident on the GPU.
Well, that assumes you have enough memory locally for your working set.
And if not you need to be clever about caching and I did not see any
code in CUDA that takes care of that job for you. I have seen libraries
like libflame that claim to do that for you, but I have not played with
them yet.
> For instance, you could imagine a variant of ndarray that held a
> pointer to a device array. Of course this requires that the other
> expensive parts of your algorithm also execute on the GPU so you're
> not shuttling data over the PCIe bus all the time.
Absolutely. I think that GPUs can fill a large niche for scientific
computations, but it is not (yet?) the general purpose CPU it is
sometimes made out to be.
>
> Full Disclosure: I'm a researcher at NVIDIA
Cool. Thanks for the links by the way.
As I mentioned we have bought Tesla hardware and are working on getting
our code to use GPUs for numerical linear algebra, exact linear algebra
and shortly also things like monte carlo simulation. I do think that the
GPU is extremely useful for much of the above, but there are plenty of
programming issues to resolve and a lot of infrastructure code to be
written before GPU computing becomes ubiquitous. After the last new
thing I had put my hope in (the Cell CPU) basically turned out to be a
dud I am hesitant about anything until the code I am running actually
sees the benefit.
The thing with NVidia I am unhappy about is that CUDA is not free as in
freedom. I am not a FSF zealot, so I will not try to convince anyone to
make their software free. Given a choice between OpenCL and CUDA you
have the lead at the moment because you actually have been shipping a
working product for more than a year, but I am not so sure that in the
long term OpenCL won't get people's mindshare. If you look at the
history of 3D acceleration we started with numerous APIs that we all
supplanted by OpenGL which then got pushed aside by DirectX. Anyway, no
point in ranting here any more ;)
Cheers,
Michael
More information about the Numpy-discussion
mailing list