[Numpy-discussion] NumPy re-factoring project

Francesc Alted faltet@pytables....
Fri Jun 11 03:08:40 CDT 2010


A Friday 11 June 2010 02:27:18 Sturla Molden escrigué:
> >> Another thing I did when reimplementing lfilter was "copy-in copy-out"
> >> for strided arrays.
> >
> > What is copy-in copy out ? I am not familiar with this term ?
> 
> Strided memory access is slow. So it often helps to make a temporary
> copy that are contiguous.

In my experience, this technique will only work well with strided arrays if 
you are going to re-use the data of these temporaries in cache, or your data 
is unaligned.  But if you are going to use the data only once (and this is 
very common in NumPy element-wise operations), this is rather counter-
productive for strided arrays.

For example, in numexpr, we made a lot of different tests comparing "copy-in 
copy-out" and direct access techniques for strided arrays.  The result was 
that operations with direct access showed significantly better performance 
with strided arrays.  On the contrary, for unaligned arrays the copy-in copy-
out technique gave better results.

Look at these times, where the arrays where unidimensional with a length of 1 
million element each, but the results can be extrapolated to larger, 
multidimensional arrays (the original benchmark file is bench/vml_timing.py):

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=                 
Numexpr version:   1.3.2.dev169                                                              
NumPy version:     1.4.1rc2                                                                  
Python version:    2.6.1 (r261:67515, Feb  3 2009, 17:34:37)                                 
[GCC 4.3.2 [gcc-4_3-branch revision 141291]]                                                 
Platform:          linux2-x86_64                                                             
AMD/Intel CPU?     True                                                                      
VML available?     True                                                                      
VML/MKL version:   Intel(R) Math Kernel Library Version 10.1.0 Product Build 
081809.14 for Intel(R) 64 architecture applications                                                                                                                
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=                                            

To start with, times between numpy and numexpr are very similar for very 
simple expressions (except for unaligned arrays, where "copy-in copy-out" 
works pretty well for numexpr):

******************* Expression: i2 > 0                                                                                  
                        numpy: 0.0016                                                                                   
                numpy strided: 0.0037                                                                                   
              numpy unaligned: 0.0086                                                                                   
                      numexpr: 0.0016 Speed-up of numexpr over numpy: 0.9512                                            
              numexpr strided: 0.0039 Speed-up of numexpr over numpy: 0.964                                             
            numexpr unaligned: 0.0042 Speed-up of numexpr over numpy: 2.0598                                            

When doing some basic operations (mind that there are no temporaries here, so 
numpy should be not in great disadvantage), direct access to strided data goes 
between 2x and 3x faster than numpy:

******************* Expression: f3+f4
                        numpy: 0.0060
                numpy strided: 0.0176
              numpy unaligned: 0.0166
                      numexpr: 0.0052 Speed-up of numexpr over numpy: 1.1609
              numexpr strided: 0.0086 Speed-up of numexpr over numpy: 2.0584
            numexpr unaligned: 0.0099 Speed-up of numexpr over numpy: 1.6785

******************* Expression: f3+i2
                        numpy: 0.0060
                numpy strided: 0.0176
              numpy unaligned: 0.0176
                      numexpr: 0.0031 Speed-up of numexpr over numpy: 1.9137
              numexpr strided: 0.0061 Speed-up of numexpr over numpy: 2.8789
            numexpr unaligned: 0.0078 Speed-up of numexpr over numpy: 2.2411

Notice how, until now, absolute times in numexpr and strided arrays (using the 
direct technique) are faster than the unaligned case (copy-in copy-out).

Also, when evaluating transcendental expressions (numexpr uses Intel's Vector 
Math Library, VML, here), direct access is again faster than NumPy:

******************* Expression: exp(f3)
                        numpy: 0.0150  
                numpy strided: 0.0155  
              numpy unaligned: 0.0222  
                      numexpr: 0.0030 Speed-up of numexpr over numpy: 5.0268
              numexpr strided: 0.0081 Speed-up of numexpr over numpy: 1.9086
            numexpr unaligned: 0.0066 Speed-up of numexpr over numpy: 3.3454

******************* Expression: log(exp(f3)+1)/f4
                        numpy: 0.0486            
                numpy strided: 0.0563            
              numpy unaligned: 0.0639            
                      numexpr: 0.0121 Speed-up of numexpr over numpy: 4.0332
              numexpr strided: 0.0170 Speed-up of numexpr over numpy: 3.3067
            numexpr unaligned: 0.0164 Speed-up of numexpr over numpy: 3.8833

However, now that I see the latter figures, I don't remember that we have 
checked whether a copy-in copy-out technique would work faster in combination 
with VML.  By looking at the better absolute times in unaligned arrays, I'd 
say chances are that performance for the strided scenario *might* benefit from 
using copy-in/copy-out.  Mmh, that's worth a try...

-- 
Francesc Alted


More information about the NumPy-Discussion mailing list