[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

Sebastian Haase seb.haase@gmail....
Wed Feb 16 09:02:16 CST 2011


Update:
I just noticed that using Eric's OpenMP code gave me only a 1.35x
speedup when comparing 3 threads vs. my non OpenMP code. However, when
comparing 3 threads vs. 1 thread, I could call this a 2.55x speedup.
This obviously sounds much better, but is obviously not the number
that matters...
(Ergo Eric's 11x speedup on 12 cores might also be more like 5x or 6x
when compared to simple C. w/o any OpenMP overhead).

Finally, I have to report that when plugging the C code (from
scipy.spatial.distance.cdist) into my real application,
which compares for one sample file point pairs from a sequence of 1000
microscopy images,
 the speed gain appears to be a total of 2x (only !!) -- compared to
my original app that used the simple numpy version implemented like
this:

def pyDist2(a_ps,b_ps, ab_dist):
    ab_dist = N.zeros(shape=(a_ps.shape[0],b_ps.shape[0]), dtype= N.float64)
    for i in range(a_ps.shape[0]):
            d = b_ps- a_ps[i]
            ab_dist[i] = N.sqrt( N.sum(d**2, -1) )
    return ab_dist

(The test I was using before was just testing one pair of two arrays
of length 329 and 340, whereas the average length for the 1000 images
is 260
It could also be that I have substantial constant over from some
OpenGL lists that I'm generating in my app.)

Nevertheless, I'm happy having learned something about OpenMP.
I'm still trying to get more info from valgrind.
And, maybe someone could comment, if SSE would help for the function
in question.

Thanks,
Sebastian.



On Wed, Feb 16, 2011 at 1:50 PM, Sebastian Haase <seb.haase@gmail.com> wrote:
> Eric,
> this is amazing !! Thanks very much, I have rarely seen such a compact
> source example that just worked.
> The timings I get are:
> c_threads  1  time  0.00155731916428
> c_threads  2  time  0.000829789638519
> c_threads  3  time  0.000616888999939
> c_threads  4  time  0.000704760551453
> c_threads  5  time  0.000933599472046
> c_threads  6  time  0.000809240341187
> c_threads  7  time  0.000837240219116
> c_threads  8  time  0.000817658901215
> c_threads  9  time  0.000843930244446
> c_threads 10  time  0.000861320495605
> c_threads 11  time  0.000936930179596
> c_threads 12  time  0.000847370624542
>
> The optimum for my
> Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
> seems to be at 3 threads with .6 ms for the given test case.
>
> I just reran my normal non-OpenMP C based code, and it takes
> .84 ms   (1.35x slower).
> from scipy.spatial import distance
> distance.cdist(a,b)   takes
> .9 ms  -- which includes the allocation of the output array, because
> there is no `out` option available.
>
>
> So I'm happy that OpenMP works,
> but apparently on my CPU the speed increase is not overwhelming (yet)...
>
> Thanks,
> -- Sebastian
>
>
> On Wed, Feb 16, 2011 at 4:50 AM, Eric Carlson <ecarlson@eng.ua.edu> wrote:
>> I don't have the slightest idea what I'm doing, but....
>>
>>
>>
>> ____
>> file name - the_lib.c
>> ___
>> #include <stdio.h>
>> #include <time.h>
>> #include <omp.h>
>> #include <math.h>
>>
>> void dists2d(      double *a_ps, int na,
>>                   double *b_ps, int nb,
>>                   double *dist, int num_threads)
>> {
>>
>>    int i, j;
>>    int dynamic=0;
>>    omp_set_dynamic(dynamic);
>>    omp_set_num_threads(num_threads);
>>    double ax,ay, dif_x, dif_y;
>>    int nx1=2;
>>    int nx2=2;
>>
>>
>> #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
>>        for(i=0;i<na;i++)
>>          {
>>                ax=a_ps[i*nx1];
>>                 ay=a_ps[i*nx1+1];
>>                for(j=0;j<nb;j++)
>>                  {     dif_x = ax - b_ps[j*nx2];
>>                         dif_y = ay - b_ps[j*nx2+1];
>>                         dist[2*i+j]  = sqrt(dif_x*dif_x+dif_y*dif_y);
>>                  }
>>          }
>> }
>>
>>
>> ________
>>
>>
>> COMPILE:
>> __________
>> gcc -c the_lib.c -fPIC -fopenmp -ffast-math
>> gcc -shared -o the_lib.so the_lib.o -lgomp -lm
>>
>>
>> ____
>>
>> the_python_prog.py
>> _____________
>>
>> from ctypes import *
>> my_lib=CDLL('the_lib.so') #or full path to lib
>> import numpy as np
>> import time
>>
>> na=329
>> nb=340
>> a=np.random.rand(na,2)
>> b=np.random.rand(nb,2)
>> c=np.zeros(na*nb)
>> trials=100
>> max_threads = 24
>> for k in range(1,max_threads):
>>     n_threads =c_int(k)
>>     na2=c_int(na)
>>     nb2=c_int(nb)
>>
>>     start = time.time()
>>     for k1 in range(trials):
>>         ret =
>> my_lib.dists2d(a.ctypes.data_as(c_void_p),na2,b.ctypes.data_as(c_void_p),nb2,c.ctypes.data_as(c_void_p),n_threads)
>>     print "c_threads",k, " time ", (time.time()-start)/trials
>>
>>
>>
>> ____
>> Results on my machine, dual xeon, 12 cores
>> na=329
>> nb=340
>> ____
>>
>> 100 trials each:
>> c_threads 1  time  0.00109949827194
>> c_threads 2  time  0.0005726313591
>> c_threads 3  time  0.000429179668427
>> c_threads 4  time  0.000349278450012
>> c_threads 5  time  0.000287139415741
>> c_threads 6  time  0.000252468585968
>> c_threads 7  time  0.000222821235657
>> c_threads 8  time  0.000206289291382
>> c_threads 9  time  0.000187981128693
>> c_threads 10  time  0.000172770023346
>> c_threads 11  time  0.000164999961853
>> c_threads 12  time  0.000157740116119
>>
>> ____
>> ____
>> Results on my machine, dual xeon, 12 cores
>> na=3290
>> nb=3400
>> ______
>> 100 trials each:
>> c_threads 1  time  0.10744508028
>> c_threads 2  time  0.0542239999771
>> c_threads 3  time  0.037127559185
>> c_threads 4  time  0.0280736112595
>> c_threads 5  time  0.0228648614883
>> c_threads 6  time  0.0194904088974
>> c_threads 7  time  0.0165715909004
>> c_threads 8  time  0.0145838689804
>> c_threads 9  time  0.0130002498627
>> c_threads 10  time  0.0116940999031
>> c_threads 11  time  0.0107557415962
>> c_threads 12  time  0.00990005016327 (speedup almost 11)
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>


More information about the NumPy-Discussion mailing list