[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
David Cournapeau
david@ar.media.kyoto-u.ac...
Sat Mar 22 23:59:39 CDT 2008
Charles R Harris wrote:
>
> It looks like memory access is the bottleneck, otherwise running 4
> floats through in parallel should go a lot faster. I need to modify
> the program a bit and see how it works for doubles.
I am not sure the benchmark is really meaningful: it does not uses
aligned buffers (16 bytes alignement), and because of that, does not
give a good idea of what can be expected from SSE. It shows why it is
not so easy to get good performances, and why just throwing a few
optimized loops won't work, though. Using sse/sse2 from unaligned
buffers is a waste of time. Without this alignement, you need to take
into account the alignement (using _mm_loadu_ps vs _mm_load_ps), and
that's extremely slow, basically killing most of the speed increase you
can expect from using sse.
Here what I get with the above benchmark:
100 0.0002ms (100.0%) 0.0001ms ( 71.5%) 0.0001ms
( 85.0%)
1000 0.0014ms (100.0%) 0.0010ms ( 70.6%) 0.0013ms
( 96.8%)
10000 0.0162ms (100.0%) 0.0095ms ( 58.2%) 0.0128ms
( 78.7%)
100000 0.4189ms (100.0%) 0.4135ms ( 98.7%) 0.4149ms
( 99.0%)
1000000 5.9523ms (100.0%) 5.8933ms ( 99.0%) 5.8910ms
( 99.0%)
10000000 58.9645ms (100.0%) 58.2620ms ( 98.8%) 58.7443ms
( 99.6%)
Basically, no help at all: this is on a P4, which fpu is extremely slow
if not used with optimized sse.
Now, if I use posix_memalign, replace the intrinsics for aligned access,
and use an accurate cycle counter (cycle.h, provided by fftw).
Compiled as is:
Testing methods...
All OK
Problem size Simple
Intrin Inline
100 4.16e+02 cycles (100.0%) 4.04e+02 cycles
( 97.1%) 4.92e+02 cycles (118.3%)
1000 3.66e+03 cycles (100.0%) 3.11e+03 cycles
( 84.8%) 4.10e+03 cycles (111.9%)
10000 3.47e+04 cycles (100.0%) 3.01e+04 cycles
( 86.7%) 4.06e+04 cycles (116.8%)
100000 1.36e+06 cycles (100.0%) 1.34e+06 cycles
( 98.7%) 1.45e+06 cycles (106.7%)
1000000 1.92e+07 cycles (100.0%) 1.87e+07 cycles
( 97.1%) 1.89e+07 cycles ( 98.2%)
10000000 1.86e+08 cycles (100.0%) 1.80e+08 cycles
( 96.8%) 1.81e+08 cycles ( 97.4%)
Compiled with -DALIGNED, wich uses aligned access intrinsics:
Testing methods...
All OK
Problem size Simple
Intrin Inline
100 4.16e+02 cycles (100.0%) 1.96e+02 cycles
( 47.1%) 4.92e+02 cycles (118.3%)
1000 3.82e+03 cycles (100.0%) 1.56e+03 cycles
( 40.8%) 4.22e+03 cycles (110.4%)
10000 3.46e+04 cycles (100.0%) 1.92e+04 cycles
( 55.5%) 4.13e+04 cycles (119.4%)
100000 1.32e+06 cycles (100.0%) 1.12e+06 cycles
( 85.0%) 1.16e+06 cycles ( 87.8%)
1000000 1.95e+07 cycles (100.0%) 1.92e+07 cycles
( 98.3%) 1.95e+07 cycles (100.2%)
10000000 1.82e+08 cycles (100.0%) 1.79e+08 cycles
( 98.4%) 1.81e+08 cycles ( 99.3%)
This gives a drastic difference (I did not touch inline code, because it
is sunday and I am lazy). If I use this on a sane CPU (core 2 duo,
macbook) instead of my pentium4, I get better results (in particular,
sse code is never slower, and I get a double speed increase as long as
the buffer can be in cache).
It looks like using prefect also gives some improvements when on the
edge of the cache size (my P4 has a 512 kb L2 cache):
Testing methods...
All OK
Problem size Simple
Intrin Inline
100 4.16e+02 cycles (100.0%) 2.52e+02 cycles
( 60.6%) 4.92e+02 cycles (118.3%)
1000 3.55e+03 cycles (100.0%) 1.85e+03 cycles
( 52.2%) 4.21e+03 cycles (118.7%)
10000 3.48e+04 cycles (100.0%) 1.76e+04 cycles
( 50.6%) 4.13e+04 cycles (118.9%)
100000 1.11e+06 cycles (100.0%) 7.20e+05 cycles
( 64.8%) 1.12e+06 cycles (101.3%)
1000000 1.91e+07 cycles (100.0%) 1.98e+07 cycles
(103.4%) 1.91e+07 cycles (100.0%)
10000000 1.83e+08 cycles (100.0%) 1.90e+08 cycles
(103.9%) 1.82e+08 cycles ( 99.3%)
The code can be seen there:
http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/vec_bench.c
http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/Makefile
http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/cycle.h
Another thing that I have not seen mentioned but may worth pursuing is
using SSE in element-wise operations: you can have extremely fast exp,
sin, cos and co using sse. Those are much easier to include in numpy
(but much more difficult to implement...). See for example:
http://www.pixelglow.com/macstl/
cheers,
David
More information about the Numpy-discussion
mailing list