[Numpy-discussion] Fast threading solution thoughts

Michael Abshoff michael.abshoff@googlemail....
Thu Feb 12 09:54:07 CST 2009

Nathan Bell wrote:
> On Thu, Feb 12, 2009 at 8:19 AM, Michael Abshoff
> <michael.abshoff@googlemail.com> wrote:


>> No even close. The current generation peaks at around 1.2 TFlops single
>> precision, 280 GFlops double precision for ATI's hardware. The main
>> problem with those numbers is that the memory on the graphics card
>> cannot feed the data fast enough into the GPU to achieve theoretical
>> peak. So those hundreds of GFlops are pure marketing :)
> If your application is memory bandwidth limited, then yes you're not
> likely to see 100s of GFlops anytime soon.  However, compute limited
> application can and do achieve 100s of GFlops on GPUs.  Basic
> operations like FFTs and (level 3) BLAS are compute limited, as are
> the following applications:
> http://www.ks.uiuc.edu/Research/gpu/
> http://www.dam.brown.edu/scicomp/scg-media/report_files/BrownSC-2008-27.pdf

Yes, certainly. But Sturla implied that some "random consumer GPU" (to 
put a negative spin on it :) could do the above. There also seems to be 
a huge expectation that "porting your code to the GPU" will make it 10 
to 100 times faster. There are cases like that as mentioned above, but 
this only applies to a subset of problems.

Another problem is RAM for many datasets I work with and 512 to 1024 MB 
aren't just plainly cutting it. This means Tesla cards at $1k upward and 
all the sudden we are playing a different game.

9 months ago at the beginning when we started playing with CUDA we took 
a MacBook pro with a decent NVidia card and laughed hard after it become 
clear that its Core2 with either ATLAS or the AccelerateFramework (which 
is more or ATLAS for its BLAS bits) was faster than the build in NVidia 
card with either single or double precision. Surely, this is a consumer 
level laptop GPU, but I did expect more.

>> So in reality you might get anywhere from 20% to 60% (if you are lucky)
>> locally before accounting for transfers from main memory to GPU memory
>> and so on. Given that recent Intel CPUs give you about 7 to 11 Glops
>> Double per core and libraries like ATLAS give you that performance today
>> without the need to jump through hoops these number start to look a lot
>> less impressive.
> You neglect to mention that CPUs, which have roughly 1/10th the memory
> bandwidth of high-end GPUs, are memory bound on the very same
> problems.  You will not see 7 to 11 GFLops on a memory bound CPU code
> for the same reason you argue that GPUs don't achieve 100s of GFLops
> on memory bound GPU codes.

I am seeing 7 to 11 GFLOP per core for matrix matrix multiplies on Intel 
CPUs using Strassen for matrix matrix multiplies. And we did scale out 
linear on 16 core Opterons as well as a 64 core Itanium box using ATLAS 
for BLAS level 3 matrix matrix multiplu. When you have multiple GPUs you 
do not have shared memory architectures (AFAIK the 4 GPU boxen sold by 
NVidia have fast buses between the cards, but aren't ccNUMA or anything 
like that - please correct me if I am wrong).

> In severely memory bound applications like sparse matrix-vector
> multiplication (i.e. A*x for sparse A) the best GPU performance you
> can expect is ~10 GFLops on the GPU and ~1 GFLop on the CPU (in double
> precision).  We discuss this problem in the following tech report:
> http://forums.nvidia.com/index.php?showtopic=83825

Ok, I care about dense operations primarily, but it is interesting to 
see that the GPU fares well on sparse LA.

> It's true that host<->device transfers can be a bottleneck.  In many
> cases, the solution is to simply leave the data resident on the GPU.

Well, that assumes you have enough memory locally for your working set. 
And if not you need to be clever about caching and I did not see any 
code in CUDA that takes care of that job for you. I have seen libraries 
like libflame that claim to do that for you, but I have not played with 
them yet.

> For instance, you could imagine a variant of ndarray that held a
> pointer to a device array.  Of course this requires that the other
> expensive parts of your algorithm also execute on the GPU so you're
> not shuttling data over the PCIe bus all the time.

Absolutely. I think that GPUs can fill a large niche for scientific 
computations, but it is not (yet?) the general purpose CPU it is 
sometimes made out to be.

> Full Disclosure: I'm a researcher at NVIDIA

Cool. Thanks for the links by the way.

As I mentioned we have bought Tesla hardware and are working on getting 
our code to use GPUs for numerical linear algebra, exact linear algebra 
and shortly also things like monte carlo simulation. I do think that the 
GPU is extremely useful for much of the above, but there are plenty of 
programming issues to resolve and a lot of infrastructure code to be 
written before GPU computing becomes ubiquitous. After the last new 
thing I had put my hope in (the Cell CPU) basically turned out to be a 
dud I am hesitant about anything until the code I am running actually 
sees the benefit.

The thing with NVidia I am unhappy about is that CUDA is not free as in 
freedom. I am not a FSF zealot, so I will not try to convince anyone to 
make their software free. Given a choice between OpenCL and CUDA you 
have the lead at the moment because you actually have been shipping a 
working product for more than a year, but I am not so sure that in the 
long term OpenCL won't get people's mindshare. If you look at the 
history of 3D acceleration we started with numerous APIs that we all 
supplanted by OpenGL which then got pushed aside by DirectX.  Anyway, no 
point in ranting here any more ;)



More information about the Numpy-discussion mailing list