[Numpy-discussion] OT: performance in C extension; OpenMP, or SSE ?

Sebastian Haase seb.haase@gmail....
Wed Feb 16 06:50:44 CST 2011


Eric,
this is amazing !! Thanks very much, I have rarely seen such a compact
source example that just worked.
The timings I get are:
c_threads  1  time  0.00155731916428
c_threads  2  time  0.000829789638519
c_threads  3  time  0.000616888999939
c_threads  4  time  0.000704760551453
c_threads  5  time  0.000933599472046
c_threads  6  time  0.000809240341187
c_threads  7  time  0.000837240219116
c_threads  8  time  0.000817658901215
c_threads  9  time  0.000843930244446
c_threads 10  time  0.000861320495605
c_threads 11  time  0.000936930179596
c_threads 12  time  0.000847370624542

The optimum for my
Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
seems to be at 3 threads with .6 ms for the given test case.

I just reran my normal non-OpenMP C based code, and it takes
.84 ms   (1.35x slower).
from scipy.spatial import distance
distance.cdist(a,b)   takes
.9 ms  -- which includes the allocation of the output array, because
there is no `out` option available.


So I'm happy that OpenMP works,
but apparently on my CPU the speed increase is not overwhelming (yet)...

Thanks,
-- Sebastian


On Wed, Feb 16, 2011 at 4:50 AM, Eric Carlson <ecarlson@eng.ua.edu> wrote:
> I don't have the slightest idea what I'm doing, but....
>
>
>
> ____
> file name - the_lib.c
> ___
> #include <stdio.h>
> #include <time.h>
> #include <omp.h>
> #include <math.h>
>
> void dists2d(      double *a_ps, int na,
>                   double *b_ps, int nb,
>                   double *dist, int num_threads)
> {
>
>    int i, j;
>    int dynamic=0;
>    omp_set_dynamic(dynamic);
>    omp_set_num_threads(num_threads);
>    double ax,ay, dif_x, dif_y;
>    int nx1=2;
>    int nx2=2;
>
>
> #pragma omp parallel for private(j, i,ax,ay, dif_x, dif_y)
>        for(i=0;i<na;i++)
>          {
>                ax=a_ps[i*nx1];
>                 ay=a_ps[i*nx1+1];
>                for(j=0;j<nb;j++)
>                  {     dif_x = ax - b_ps[j*nx2];
>                         dif_y = ay - b_ps[j*nx2+1];
>                         dist[2*i+j]  = sqrt(dif_x*dif_x+dif_y*dif_y);
>                  }
>          }
> }
>
>
> ________
>
>
> COMPILE:
> __________
> gcc -c the_lib.c -fPIC -fopenmp -ffast-math
> gcc -shared -o the_lib.so the_lib.o -lgomp -lm
>
>
> ____
>
> the_python_prog.py
> _____________
>
> from ctypes import *
> my_lib=CDLL('the_lib.so') #or full path to lib
> import numpy as np
> import time
>
> na=329
> nb=340
> a=np.random.rand(na,2)
> b=np.random.rand(nb,2)
> c=np.zeros(na*nb)
> trials=100
> max_threads = 24
> for k in range(1,max_threads):
>     n_threads =c_int(k)
>     na2=c_int(na)
>     nb2=c_int(nb)
>
>     start = time.time()
>     for k1 in range(trials):
>         ret =
> my_lib.dists2d(a.ctypes.data_as(c_void_p),na2,b.ctypes.data_as(c_void_p),nb2,c.ctypes.data_as(c_void_p),n_threads)
>     print "c_threads",k, " time ", (time.time()-start)/trials
>
>
>
> ____
> Results on my machine, dual xeon, 12 cores
> na=329
> nb=340
> ____
>
> 100 trials each:
> c_threads 1  time  0.00109949827194
> c_threads 2  time  0.0005726313591
> c_threads 3  time  0.000429179668427
> c_threads 4  time  0.000349278450012
> c_threads 5  time  0.000287139415741
> c_threads 6  time  0.000252468585968
> c_threads 7  time  0.000222821235657
> c_threads 8  time  0.000206289291382
> c_threads 9  time  0.000187981128693
> c_threads 10  time  0.000172770023346
> c_threads 11  time  0.000164999961853
> c_threads 12  time  0.000157740116119
>
> ____
> ____
> Results on my machine, dual xeon, 12 cores
> na=3290
> nb=3400
> ______
> 100 trials each:
> c_threads 1  time  0.10744508028
> c_threads 2  time  0.0542239999771
> c_threads 3  time  0.037127559185
> c_threads 4  time  0.0280736112595
> c_threads 5  time  0.0228648614883
> c_threads 6  time  0.0194904088974
> c_threads 7  time  0.0165715909004
> c_threads 8  time  0.0145838689804
> c_threads 9  time  0.0130002498627
> c_threads 10  time  0.0116940999031
> c_threads 11  time  0.0107557415962
> c_threads 12  time  0.00990005016327 (speedup almost 11)
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>


More information about the NumPy-Discussion mailing list