[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Scott Ransom
sransom@nrao....
Sun Mar 23 11:00:03 CDT 2008
Hi David et al,
Very interesting. I thought that the 64-bit gcc's automatically
aligned memory on 16-bit (or 32-bit) boundaries. But apparently
not. Because running your code certainly made the intrinsic code
quite a bit faster. However, another thing that I noticed was
that the "simple" code was _much_ faster using gcc-4.3 with -O3
than with -O2. I've noticed this will some other code recently as
well -- the auto loop-unrolling really helps for this type of
code.
You can see my benchmarks here (posted there to avoind line wrap
issues):
http://www.cv.nrao.edu/~sransom/vec_results.txt
Scott
On Sun, Mar 23, 2008 at 01:59:39PM +0900, David Cournapeau wrote:
> Charles R Harris wrote:
> >
> > It looks like memory access is the bottleneck, otherwise running 4
> > floats through in parallel should go a lot faster. I need to modify
> > the program a bit and see how it works for doubles.
>
> I am not sure the benchmark is really meaningful: it does not uses
> aligned buffers (16 bytes alignement), and because of that, does not
> give a good idea of what can be expected from SSE. It shows why it is
> not so easy to get good performances, and why just throwing a few
> optimized loops won't work, though. Using sse/sse2 from unaligned
> buffers is a waste of time. Without this alignement, you need to take
> into account the alignement (using _mm_loadu_ps vs _mm_load_ps), and
> that's extremely slow, basically killing most of the speed increase you
> can expect from using sse.
>
> Here what I get with the above benchmark:
>
> 100 0.0002ms (100.0%) 0.0001ms ( 71.5%) 0.0001ms
> ( 85.0%)
> 1000 0.0014ms (100.0%) 0.0010ms ( 70.6%) 0.0013ms
> ( 96.8%)
> 10000 0.0162ms (100.0%) 0.0095ms ( 58.2%) 0.0128ms
> ( 78.7%)
> 100000 0.4189ms (100.0%) 0.4135ms ( 98.7%) 0.4149ms
> ( 99.0%)
> 1000000 5.9523ms (100.0%) 5.8933ms ( 99.0%) 5.8910ms
> ( 99.0%)
> 10000000 58.9645ms (100.0%) 58.2620ms ( 98.8%) 58.7443ms
> ( 99.6%)
>
> Basically, no help at all: this is on a P4, which fpu is extremely slow
> if not used with optimized sse.
>
> Now, if I use posix_memalign, replace the intrinsics for aligned access,
> and use an accurate cycle counter (cycle.h, provided by fftw).
>
> Compiled as is:
>
> Testing methods...
> All OK
>
> Problem size Simple
> Intrin Inline
> 100 4.16e+02 cycles (100.0%) 4.04e+02 cycles
> ( 97.1%) 4.92e+02 cycles (118.3%)
> 1000 3.66e+03 cycles (100.0%) 3.11e+03 cycles
> ( 84.8%) 4.10e+03 cycles (111.9%)
> 10000 3.47e+04 cycles (100.0%) 3.01e+04 cycles
> ( 86.7%) 4.06e+04 cycles (116.8%)
> 100000 1.36e+06 cycles (100.0%) 1.34e+06 cycles
> ( 98.7%) 1.45e+06 cycles (106.7%)
> 1000000 1.92e+07 cycles (100.0%) 1.87e+07 cycles
> ( 97.1%) 1.89e+07 cycles ( 98.2%)
> 10000000 1.86e+08 cycles (100.0%) 1.80e+08 cycles
> ( 96.8%) 1.81e+08 cycles ( 97.4%)
>
> Compiled with -DALIGNED, wich uses aligned access intrinsics:
>
> Testing methods...
> All OK
>
> Problem size Simple
> Intrin Inline
> 100 4.16e+02 cycles (100.0%) 1.96e+02 cycles
> ( 47.1%) 4.92e+02 cycles (118.3%)
> 1000 3.82e+03 cycles (100.0%) 1.56e+03 cycles
> ( 40.8%) 4.22e+03 cycles (110.4%)
> 10000 3.46e+04 cycles (100.0%) 1.92e+04 cycles
> ( 55.5%) 4.13e+04 cycles (119.4%)
> 100000 1.32e+06 cycles (100.0%) 1.12e+06 cycles
> ( 85.0%) 1.16e+06 cycles ( 87.8%)
> 1000000 1.95e+07 cycles (100.0%) 1.92e+07 cycles
> ( 98.3%) 1.95e+07 cycles (100.2%)
> 10000000 1.82e+08 cycles (100.0%) 1.79e+08 cycles
> ( 98.4%) 1.81e+08 cycles ( 99.3%)
>
> This gives a drastic difference (I did not touch inline code, because it
> is sunday and I am lazy). If I use this on a sane CPU (core 2 duo,
> macbook) instead of my pentium4, I get better results (in particular,
> sse code is never slower, and I get a double speed increase as long as
> the buffer can be in cache).
>
> It looks like using prefect also gives some improvements when on the
> edge of the cache size (my P4 has a 512 kb L2 cache):
>
> Testing methods...
> All OK
>
> Problem size Simple
> Intrin Inline
> 100 4.16e+02 cycles (100.0%) 2.52e+02 cycles
> ( 60.6%) 4.92e+02 cycles (118.3%)
> 1000 3.55e+03 cycles (100.0%) 1.85e+03 cycles
> ( 52.2%) 4.21e+03 cycles (118.7%)
> 10000 3.48e+04 cycles (100.0%) 1.76e+04 cycles
> ( 50.6%) 4.13e+04 cycles (118.9%)
> 100000 1.11e+06 cycles (100.0%) 7.20e+05 cycles
> ( 64.8%) 1.12e+06 cycles (101.3%)
> 1000000 1.91e+07 cycles (100.0%) 1.98e+07 cycles
> (103.4%) 1.91e+07 cycles (100.0%)
> 10000000 1.83e+08 cycles (100.0%) 1.90e+08 cycles
> (103.9%) 1.82e+08 cycles ( 99.3%)
>
> The code can be seen there:
>
> http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/vec_bench.c
> http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/Makefile
> http://www.ar.media.kyoto-u.ac.jp/members/david/archives/t2/cycle.h
>
> Another thing that I have not seen mentioned but may worth pursuing is
> using SSE in element-wise operations: you can have extremely fast exp,
> sin, cos and co using sse. Those are much easier to include in numpy
> (but much more difficult to implement...). See for example:
>
> http://www.pixelglow.com/macstl/
>
> cheers,
>
> David
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
--
--
Scott M. Ransom Address: NRAO
Phone: (434) 296-0320 520 Edgemont Rd.
email: sransom@nrao.edu Charlottesville, VA 22903 USA
GPG Fingerprint: 06A9 9553 78BE 16DB 407B FFCA 9BFA B6FF FFD3 2989
More information about the Numpy-discussion
mailing list