[SciPy-User] Parallel operations on the columns of a numpy array

François Bouffard fbouffard@gmail....
Fri Aug 13 14:51:38 CDT 2010

I chose Python and numpy for a project of mine which had to be
programmed really quickly, and this solution really shines in terms of
development time.

However I encountered a speed bottleneck when trying to perform
"embarassingly parallel" operations on the columns of a medium-sized
numpy array. More specifically, I need to perform fftshift() and fft()
on every column of this array, as quickly as possible. I cannot
zero-pad the array to the next power-of-two, so of course I use the
fft(x, None, 0) syntax.

Comparing my results to a Matlab implementation, I quickly realized
that I suffered a factor-C penality, where C is the number of CPUs
available on the machine I'm working on. Whereas in Matlab, all CPUs
are used at 100%, only one (in average) is used in Python. This can be
a major drawback since the production machine is a dual-quadcore

I tested numpy.fft.fft and scipy.fftpack.fft and quickly found a
important speed difference in favor of fftpack; however multiprocessor
support is still not there. I understand that FFTW is not distributed
with numpy anymore, and using a wrapper for that library may be an

I also understand that scipy and numpy must be built for ATLAS, LAPACK
or Intel MKL to maximize the performances of matrix operations. I have
not come around to building numpy with ATLAS support yet, but I have
installed Christoph Gholke's MKL-enabled packages. With these,
multi-processor support seems to be active for matrix operations such
as dot(); however it doesn't help, it seems, at splitting a a common
operation across columns of an array.

Another attempt at a speed-up consisted in using the parallel_map
function found in the handythread module graciously provided in the
Multithreading Cookbook on scipy.org. I simply split the array in as
many blocks as there are processors, and feed that to parallel_map.
While it seems that more than one processor is used in average, the
speed gain is just above negligible and still quite far from Matlab's
implementation. Maybe the overhead of this method is killing me for
arrays that are not that large (typical size is 100-by-10,000)

So I'm just looking for any hints... Would FFTW help? Would ATLAS
offer better performances in that regard than Intel's MKL? Is there a
better way to parallelize column operations?

Thanks for any idea (and sorry for this rather long post).


More information about the SciPy-User mailing list