[Numpy-discussion] Fast threading solution thoughts

Francesc Alted faltet@pytables....
Thu Feb 12 04:05:54 CST 2009


Hi Brian,

A Thursday 12 February 2009, Brian Granger escrigué:
> Hi,
>
> This is relevant for anyone who would like to speed up array based
> codes using threads.
>
> I have a simple loop that I have implemented using Cython:
>
> def backstep(np.ndarray opti, np.ndarray optf,
>              int istart, int iend, double p, double q):
>     cdef int j
>     cdef double *pi
>     cdef double *pf
>     pi = <double *>opti.data
>     pf = <double *>optf.data
>
>     with nogil:
>         for j in range(istart, iend):
>             pf[j] = (p*pi[j+1] + q*pi[j])
>
> I need to call this function *many* times and each time cannot be
> performed until the previous time is completely as there are data
> dependencies.  But, I still want to parallelize a single call to this
> function across multiple cores (notice that I am releasing the GIL
> before I do the heavy lifting).
>
> I want to break my loop range(istart,iend) into pieces and have a
> thread do each piece.  The arrays have sizes 10^3 to 10^5.
>
> Things I have tried:
[clip]

If your problem is evaluating vector expressions just like the above 
(i.e. without using transcendental functions like sin, exp, etc...), 
usually the bottleneck is on memory access, so using several threads is 
simply not going to help you achieving better performance, but rather 
the contrary (you have to deal with the additional thread overhead).  
So, frankly, I would not waste more time trying to paralelize that.

As an example, in the recent support of VML in numexpr we have disabled 
the use of VML (as well as the OpenMP threading support that comes with 
it) in cases like yours, where only additions and multiplications are 
performed (these operations are very fast in modern processors, and the 
sole bottleneck for this case is the memory bandwidth, as I've said).  
However, in case of expressions containing operations like division or 
transcendental functions, then VML activates automatically, and you can 
make use of several cores if you want.  So, if you are in this case, 
and you have access to Intel MKL (the library that contains VML), you 
may want to give numexpr a try. 

HTH,

-- 
Francesc Alted


More information about the Numpy-discussion mailing list