[Numpy-discussion] Fast threading solution thoughts
Wed Feb 11 23:52:40 CST 2009
On Wed, Feb 11, 2009 at 23:46, Brian Granger <email@example.com> wrote:
> This is relevant for anyone who would like to speed up array based
> codes using threads.
> I have a simple loop that I have implemented using Cython:
> def backstep(np.ndarray opti, np.ndarray optf,
> int istart, int iend, double p, double q):
> cdef int j
> cdef double *pi
> cdef double *pf
> pi = <double *>opti.data
> pf = <double *>optf.data
> with nogil:
> for j in range(istart, iend):
> pf[j] = (p*pi[j+1] + q*pi[j])
> I need to call this function *many* times and each time cannot be
> performed until the previous time is completely as there are data
> dependencies. But, I still want to parallelize a single call to this
> function across multiple cores (notice that I am releasing the GIL
> before I do the heavy lifting).
> I want to break my loop range(istart,iend) into pieces and have a
> thread do each piece. The arrays have sizes 10^3 to 10^5.
> Things I have tried:
> * Use pthreads and create new threads for each call to my function.
> This performed horribly due to the thread creation overhead.
> * Use a simple threadpool implementation in Python. This performed
> horribly as well even though I was not recreating threads each call.
> The reason in this case was the time spent waiting on locks in the
> thread pool implementation (which is based on Queue and threading).
> This is either a problem with threading itself or a fundamental
> limitation of the pthreads library itself.
> * My next step is to implement a thread pool using pthreads/Cython.
> Assuming I do this right, this should be as fast as I can get using
> The only tools that I know of that are *really* designed to handle
> these types of fine-grained parallel problems are:
> * Intel's TBB
> * OpenMP
> * Cilk++
> * ???
> This seem like pretty heavy solutions though.
>From a programmer's perspective, it seems to me like OpenMP is a muck
lighter weight solution than pthreads.
> So, do people have thoughts about ways of effectively using threads
> (not processes) for thin parallel loops over arrays? This is relevant
> to Numpy itself as it would be very nice if all ufuncs could release
> the GIL and be run on multiple threads.
Eric Jones tried to do this with pthreads in C some time ago. His work is here:
The lock overhead makes it usually not worthwhile.
And there's still an open problem for determining whether two strided
arrays overlap in memory.
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
More information about the Numpy-discussion