[SciPy-User] Multiprocessing and shared memory
Sun Oct 18 13:18:24 CDT 2009
I have been working on an application using scipy that solves a highly
parallel problem. To avoid the GIL in python I used to multiprocessing
package. The main issue I ran into is shared memory. All workers share
(read-only) access to a single large numpy array. This should be
simple, but implemented naively (passing the array as a function
argument to the worker process) will eventually create copies of the
whole array in memory (I guess when a process writes to the array to
change the reference count, triggering the UNIX copy on write
To avoid this I saw two mechanisms:
1. Using multiprocessing.Array and passing it to numpy.frombuffer (see
This has the disadvantage to messing with the ctypes to numpy
conversion and generally looks clumsy.
2. Using numpy.memmap.
This has the disadvantage that I need to create file descriptors, keep
track of them and make sure that the are closed at the right moment
(when I tried to get It to work implicitly, I ran into memory leaks, I
think due to the files not being closed when worker processes
Is there a third way (ideally passing a simple numpy.array and
ensuring that the workers never write to the object and hence rely on
simple Unix shared memory (copy on write)?
If not, which of the two ways above is preferred and are there some
tricks to make it work robustly?
On an somewhat unrelated note:
I read that parts of numpy internaly use multithreading to avoid the
global interpreter lock. Which parts are that and how is it triggered?
Specifically is there a way to run numerical expressions on large
arrays in parallel (each thread working on a part of the array)? I am
doing things like
exp(special.gammaln(arr1 * x) - arr2)
More information about the SciPy-User