[Numpy-discussion] numpy.random and multiprocessing
Gael Varoquaux
gael.varoquaux@normalesup....
Thu Dec 11 09:20:49 CST 2008
Hi there,
I have been using the multiprocessing module a lot to do statistical tests
such as Monte Carlo or resampling, and I have just discovered something
that makes me wonder if I haven't been accumulating false results. Given
two files:
=== test.py ===
from test_helper import task
from multiprocessing import Pool
p = Pool(4)
jobs = list()
for i in range(4):
jobs.append(p.apply_async(task, (4, )))
print [j.get() for j in jobs]
p.close()
p.join()
=== test_helper.py ===
import numpy as np
def task(x):
return np.random.random(x)
=======
If I run test.py, I get:
[array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([
0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.35773964,
0.63945684, 0.50855196, 0.08631373]), array([ 0.65357725, 0.35649382,
0.02203999, 0.7591353 ])]
In other words, the 4 processes give me the same exact results.
Now I understand why this is the case: the different instances of the
random number generator where created by forking from the same process,
so they are exactly the very same object. This is howver a fairly bad
trap. I guess other people will fall into it.
The take home message is:
**call 'numpy.random.seed()' when you are using multiprocessing**
I wonder if we can find a way to make this more user friendly? Would be
easy, in the C code, to check if the PID has changed, and if so reseed
the random number generator? I can open up a ticket for this if people
think this is desirable (I think so).
On a side note, there are a score of functions in numpy.random with
__module__ to None. It makes it inconvenient to use it with
multiprocessing (for instance it forced the creation of the 'test_helper'
file here).
Gaël
More information about the Numpy-discussion
mailing list