[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

David Cournapeau david@ar.media.kyoto-u.ac...
Sat Mar 22 23:59:39 CDT 2008


Charles R Harris wrote:
>
> It looks like memory access is the bottleneck, otherwise running 4 
> floats through in parallel should go a lot faster. I need to modify 
> the program a bit and see how it works for doubles.

I am not sure the benchmark is really meaningful: it does not uses 
aligned buffers (16 bytes alignement), and because of that, does not 
give a good idea of what can be expected from SSE. It shows why it is 
not so easy to get good performances, and why just throwing a few 
optimized loops won't work, though. Using sse/sse2 from unaligned 
buffers is a waste of time. Without this alignement, you need to take 
into account the alignement (using _mm_loadu_ps vs _mm_load_ps), and 
that's extremely slow, basically killing most of the speed increase you 
can expect from using sse.

Here what I get with the above benchmark:

                 100   0.0002ms (100.0%)   0.0001ms ( 71.5%)   0.0001ms 
( 85.0%)
                1000   0.0014ms (100.0%)   0.0010ms ( 70.6%)   0.0013ms 
( 96.8%)
               10000   0.0162ms (100.0%)   0.0095ms ( 58.2%)   0.0128ms 
( 78.7%)
              100000   0.4189ms (100.0%)   0.4135ms ( 98.7%)   0.4149ms 
( 99.0%)
             1000000   5.9523ms (100.0%)   5.8933ms ( 99.0%)   5.8910ms 
( 99.0%)
            10000000  58.9645ms (100.0%)  58.2620ms ( 98.8%)  58.7443ms 
( 99.6%)

Basically, no help at all: this is on a P4, which fpu is extremely slow 
if not used with optimized sse.

Now, if I use posix_memalign, replace the intrinsics for aligned access, 
and use an accurate cycle counter (cycle.h, provided by fftw).

Compiled as is:

Testing methods...
All OK

        Problem size                  Simple                  
Intrin                  Inline
                 100    4.16e+02 cycles (100.0%)        4.04e+02 cycles 
( 97.1%)        4.92e+02 cycles (118.3%)
                1000    3.66e+03 cycles (100.0%)        3.11e+03 cycles 
( 84.8%)        4.10e+03 cycles (111.9%)
               10000    3.47e+04 cycles (100.0%)        3.01e+04 cycles 
( 86.7%)        4.06e+04 cycles (116.8%)
              100000    1.36e+06 cycles (100.0%)        1.34e+06 cycles 
( 98.7%)        1.45e+06 cycles (106.7%)
             1000000    1.92e+07 cycles (100.0%)        1.87e+07 cycles 
( 97.1%)        1.89e+07 cycles ( 98.2%)
            10000000    1.86e+08 cycles (100.0%)        1.80e+08 cycles 
( 96.8%)        1.81e+08 cycles ( 97.4%)

Compiled with -DALIGNED, wich uses aligned access intrinsics:

Testing methods...
All OK

        Problem size                  Simple                  
Intrin                  Inline
                 100    4.16e+02 cycles (100.0%)        1.96e+02 cycles 
( 47.1%)        4.92e+02 cycles (118.3%)
                1000    3.82e+03 cycles (100.0%)        1.56e+03 cycles 
( 40.8%)        4.22e+03 cycles (110.4%)
               10000    3.46e+04 cycles (100.0%)        1.92e+04 cycles 
( 55.5%)        4.13e+04 cycles (119.4%)
              100000    1.32e+06 cycles (100.0%)        1.12e+06 cycles 
( 85.0%)        1.16e+06 cycles ( 87.8%)
             1000000    1.95e+07 cycles (100.0%)        1.92e+07 cycles 
( 98.3%)        1.95e+07 cycles (100.2%)
            10000000    1.82e+08 cycles (100.0%)        1.79e+08 cycles 
( 98.4%)        1.81e+08 cycles ( 99.3%)

This gives a drastic difference (I did not touch inline code, because it 
is sunday and I am lazy). If I use this on a sane CPU (core 2 duo, 
macbook) instead of my pentium4, I get better results (in particular, 
sse code is never slower, and I get a double speed increase as long as 
the buffer can be in cache).

It looks like using prefect also gives some improvements when on the 
edge of the cache size (my P4 has a 512 kb L2 cache):

Testing methods...
All OK

        Problem size                  Simple                  
Intrin                  Inline
                 100    4.16e+02 cycles (100.0%)        2.52e+02 cycles 
( 60.6%)        4.92e+02 cycles (118.3%)
                1000    3.55e+03 cycles (100.0%)        1.85e+03 cycles 
( 52.2%)        4.21e+03 cycles (118.7%)
               10000    3.48e+04 cycles (100.0%)        1.76e+04 cycles 
( 50.6%)        4.13e+04 cycles (118.9%)
              100000    1.11e+06 cycles (100.0%)        7.20e+05 cycles 
( 64.8%)        1.12e+06 cycles (101.3%)
             1000000    1.91e+07 cycles (100.0%)        1.98e+07 cycles 
(103.4%)        1.91e+07 cycles (100.0%)
            10000000    1.83e+08 cycles (100.0%)        1.90e+08 cycles 
(103.9%)        1.82e+08 cycles ( 99.3%)

The code can be seen there:

http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/vec_bench.c
http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/Makefile
http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/cycle.h

Another thing that I have not seen mentioned but may worth pursuing is 
using SSE in element-wise operations: you can have extremely fast exp, 
sin, cos and co using sse. Those are much easier to include in numpy 
(but much more difficult to implement...). See for example:

http://www.pixelglow.com/macstl/

cheers,

David


More information about the Numpy-discussion mailing list