[Numpy-discussion] NumPy re-factoring project
Francesc Alted
faltet@pytables....
Fri Jun 11 03:08:40 CDT 2010
A Friday 11 June 2010 02:27:18 Sturla Molden escrigué:
> >> Another thing I did when reimplementing lfilter was "copy-in copy-out"
> >> for strided arrays.
> >
> > What is copy-in copy out ? I am not familiar with this term ?
>
> Strided memory access is slow. So it often helps to make a temporary
> copy that are contiguous.
In my experience, this technique will only work well with strided arrays if
you are going to re-use the data of these temporaries in cache, or your data
is unaligned. But if you are going to use the data only once (and this is
very common in NumPy element-wise operations), this is rather counter-
productive for strided arrays.
For example, in numexpr, we made a lot of different tests comparing "copy-in
copy-out" and direct access techniques for strided arrays. The result was
that operations with direct access showed significantly better performance
with strided arrays. On the contrary, for unaligned arrays the copy-in copy-
out technique gave better results.
Look at these times, where the arrays where unidimensional with a length of 1
million element each, but the results can be extrapolated to larger,
multidimensional arrays (the original benchmark file is bench/vml_timing.py):
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Numexpr version: 1.3.2.dev169
NumPy version: 1.4.1rc2
Python version: 2.6.1 (r261:67515, Feb 3 2009, 17:34:37)
[GCC 4.3.2 [gcc-4_3-branch revision 141291]]
Platform: linux2-x86_64
AMD/Intel CPU? True
VML available? True
VML/MKL version: Intel(R) Math Kernel Library Version 10.1.0 Product Build
081809.14 for Intel(R) 64 architecture applications
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
To start with, times between numpy and numexpr are very similar for very
simple expressions (except for unaligned arrays, where "copy-in copy-out"
works pretty well for numexpr):
******************* Expression: i2 > 0
numpy: 0.0016
numpy strided: 0.0037
numpy unaligned: 0.0086
numexpr: 0.0016 Speed-up of numexpr over numpy: 0.9512
numexpr strided: 0.0039 Speed-up of numexpr over numpy: 0.964
numexpr unaligned: 0.0042 Speed-up of numexpr over numpy: 2.0598
When doing some basic operations (mind that there are no temporaries here, so
numpy should be not in great disadvantage), direct access to strided data goes
between 2x and 3x faster than numpy:
******************* Expression: f3+f4
numpy: 0.0060
numpy strided: 0.0176
numpy unaligned: 0.0166
numexpr: 0.0052 Speed-up of numexpr over numpy: 1.1609
numexpr strided: 0.0086 Speed-up of numexpr over numpy: 2.0584
numexpr unaligned: 0.0099 Speed-up of numexpr over numpy: 1.6785
******************* Expression: f3+i2
numpy: 0.0060
numpy strided: 0.0176
numpy unaligned: 0.0176
numexpr: 0.0031 Speed-up of numexpr over numpy: 1.9137
numexpr strided: 0.0061 Speed-up of numexpr over numpy: 2.8789
numexpr unaligned: 0.0078 Speed-up of numexpr over numpy: 2.2411
Notice how, until now, absolute times in numexpr and strided arrays (using the
direct technique) are faster than the unaligned case (copy-in copy-out).
Also, when evaluating transcendental expressions (numexpr uses Intel's Vector
Math Library, VML, here), direct access is again faster than NumPy:
******************* Expression: exp(f3)
numpy: 0.0150
numpy strided: 0.0155
numpy unaligned: 0.0222
numexpr: 0.0030 Speed-up of numexpr over numpy: 5.0268
numexpr strided: 0.0081 Speed-up of numexpr over numpy: 1.9086
numexpr unaligned: 0.0066 Speed-up of numexpr over numpy: 3.3454
******************* Expression: log(exp(f3)+1)/f4
numpy: 0.0486
numpy strided: 0.0563
numpy unaligned: 0.0639
numexpr: 0.0121 Speed-up of numexpr over numpy: 4.0332
numexpr strided: 0.0170 Speed-up of numexpr over numpy: 3.3067
numexpr unaligned: 0.0164 Speed-up of numexpr over numpy: 3.8833
However, now that I see the latter figures, I don't remember that we have
checked whether a copy-in copy-out technique would work faster in combination
with VML. By looking at the better absolute times in unaligned arrays, I'd
say chances are that performance for the strided scenario *might* benefit from
using copy-in/copy-out. Mmh, that's worth a try...
--
Francesc Alted
More information about the NumPy-Discussion
mailing list