[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?
Wes McKinney
wesmckinn@gmail....
Tue Feb 15 10:48:05 CST 2011
On Tue, Feb 15, 2011 at 11:33 AM, Sebastian Haase <seb.haase@gmail.com> wrote:
> Wes,
> I think I should have a couple of GPUs. I would be ready for anything
> ... if you think that I could do some easy(!) CUDA programming here,
> maybe you could guide me into the right direction...
> Thanks,
> Sebastian.
>
>
> On Tue, Feb 15, 2011 at 5:26 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>> On Tue, Feb 15, 2011 at 11:25 AM, Matthieu Brucher
>> <matthieu.brucher@gmail.com> wrote:
>>> Use directly restrict in C99 mode (__restrict does not have exactly the same
>>> semantics).
>>> For a valgrind profil, you can check my blog
>>> (http://matt.eifelle.com/2009/04/07/profiling-with-valgrind/)
>>> Basically, if you have a python script, you can valgrind --optionsinmyblog
>>> python myscript.py
>>> For PAPI, you have to install several packages (perf module for kernel for
>>> instance) and a GUI to analyze the results (in Eclispe, it should be
>>> possible).
>>> Matthieu
>>> 2011/2/15 Sebastian Haase <seb.haase@gmail.com>
>>>>
>>>> Thanks Matthieu,
>>>> using __restrict__ with g++ did not change anything. How do I use
>>>> valgrind with C extensions?
>>>> I don't know what "PAPI profil" is ...?
>>>> -Sebastian
>>>>
>>>>
>>>> On Tue, Feb 15, 2011 at 4:54 PM, Matthieu Brucher
>>>> <matthieu.brucher@gmail.com> wrote:
>>>> > Hi,
>>>> > My first move would be to add a restrict keyword to dist (i.e. dist is
>>>> > the
>>>> > only pointer to the specific memory location), and then declare dist_
>>>> > inside
>>>> > the first loop also with a restrict.
>>>> > Then, I would run valgrind or a PAPI profil on your code to see what
>>>> > causes
>>>> > the issue (false sharing, ...)
>>>> > Matthieu
>>>> >
>>>> > 2011/2/15 Sebastian Haase <seb.haase@gmail.com>
>>>> >>
>>>> >> Hi,
>>>> >> I assume that someone here could maybe help me, and I'm hoping it's
>>>> >> not too much off topic.
>>>> >> I have 2 arrays of 2d point coordinates and would like to calculate
>>>> >> all pairwise distances as fast as possible.
>>>> >> Going from Python/Numpy to a (Swigged) C extension already gave me a
>>>> >> 55x speedup.
>>>> >> (.9ms vs. 50ms for arrays of length 329 and 340).
>>>> >> I'm using gcc on Linux.
>>>> >> Now I'm wondering if I could go even faster !?
>>>> >> My hope that the compiler might automagically do some SSE2
>>>> >> optimization got disappointed.
>>>> >> Since I have a 4 core CPU I thought OpenMP might be an option;
>>>> >> I never used that, and after some playing around I managed to get
>>>> >> (only) 50% slowdown(!) :-(
>>>> >>
>>>> >> My code in short is this:
>>>> >> (My SWIG typemaps use obj_to_array_no_conversion() from numpy.i)
>>>> >> -------<Ccode> ----------
>>>> >> void dists2d(
>>>> >> double *a_ps, int nx1, int na,
>>>> >> double *b_ps, int nx2, int nb,
>>>> >> double *dist, int nx3, int ny3) throw (char*)
>>>> >> {
>>>> >> if(nx1 != 2) throw (char*) "a must be of shape (n,2)";
>>>> >> if(nx2 != 2) throw (char*) "b must be of shape (n,2)";
>>>> >> if(nx3 != nb || ny3 != na) throw (char*) "c must be of shape
>>>> >> (na,nb)";
>>>> >>
>>>> >> double *dist_;
>>>> >> int i, j;
>>>> >>
>>>> >> #pragma omp parallel private(dist_, j, i)
>>>> >> {
>>>> >> #pragma omp for nowait
>>>> >> for(i=0;i<na;i++)
>>>> >> {
>>>> >> //num_threads=omp_get_num_threads(); --> 4
>>>> >> dist_ = dist+i*nb; // dists_ is only
>>>> >> introduced for OpenMP
>>>> >> for(j=0;j<nb;j++)
>>>> >> {
>>>> >> *dist_++ = sqrt( sq(a_ps[i*nx1] -
>>>> >> b_ps[j*nx2]) +
>>>> >>
>>>> >> sq(a_ps[i*nx1+1]
>>>> >> - b_ps[j*nx2+1]) );
>>>> >> }
>>>> >> }
>>>> >> }
>>>> >> }
>>>> >> -------</Ccode> ----------
>>>> >> There is probably a simple mistake in this code - as I said I never
>>>> >> used OpenMP before.
>>>> >> It should be not too difficult to use OpenMP correctly here
>>>> >> or - maybe better -
>>>> >> is there a simple SSE(2,3,4) version that might be even better than
>>>> >> OpenMP... !?
>>>> >>
>>>> >> I supposed, that I did not get the #pragma omp lines right - any idea ?
>>>> >> Or is it in general not possible to speed this kind of code up using
>>>> >> OpenMP !?
>>>> >>
>>>> >> Thanks,
>>>> >> Sebastian Haase
>>>> >> _______________________________________________
>>>> >> NumPy-Discussion mailing list
>>>> >> NumPy-Discussion@scipy.org
>>>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Information System Engineer, Ph.D.
>>>> > Blog: http://matt.eifelle.com
>>>> > LinkedIn: http://www.linkedin.com/in/matthieubrucher
>>>> >
>>>> > _______________________________________________
>>>> > NumPy-Discussion mailing list
>>>> > NumPy-Discussion@scipy.org
>>>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>> >
>>>> >
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>>
>>> --
>>> Information System Engineer, Ph.D.
>>> Blog: http://matt.eifelle.com
>>> LinkedIn: http://www.linkedin.com/in/matthieubrucher
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
>> As an aside, do you have a GPU-equipped machine? This function looks
>> like it would be an easy CUDA target. Depends on who the users of the
>> function are (e.g. do they also have the CUDA runtime) if whether you
>> wanted to go that route.
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
These days it's pretty low-hanging fruit-- but you need to learn a bit
about the CUDA API and programming model through the user guide:
http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf
I would say it's time well-invested.
There is PyCUDA which makes the "boilerplate" parts of doing GPU
computing a bit more pleasant (just write the kernel)-- I've been
using Cython recently to interact with CUDA-based C code which is not
a terrible option either (more work than PyCUDA).
But if speed is your concern than the GPU will be *way* more bang for
your buck than OpenMP/SSE2. I'd be surprised if you can't get at least
20x speedup over single-threaded C code even without tweaking (and
maybe even a lot faster than that). But unfortunately with CUDA
programming you spend a lot of your time tweaking memory transactions
to get the best performance.
More information about the NumPy-Discussion
mailing list