[Numpy-discussion] numpy.random and multiprocessing
Bruce Southey
bsouthey@gmail....
Thu Dec 11 12:00:23 CST 2008
David Cournapeau wrote:
> Sturla Molden wrote:
>
>> On 12/11/2008 6:10 PM, Michael Gilbert wrote:
>>
>>
>>
>>> Shouldn't numpy (and/or multiprocessing) be smart enough to prevent
>>> this kind of error? A simple enough solution would be to also include
>>> the process id as part of the seed
>>>
>>>
>> It would not help, as the seeding is done prior to forking.
>>
>> I am mostly familiar with Windows programming. But what is needed is a
>> fork handler (similar to a system hook in Windows jargon) that sets a
>> new seed in the child process.
>>
>> Could pthread_atfork be used?
>>
>>
>
> The seed could be explicitly set in each task, no ?
>
> def task(x):
> np.random.seed()
> return np.random.random(x)
>
> But does this really make sense ?
>
> Is the goal to parallelize a big sampler into N tasks of M trials, to
> produce the same result as a sequential set of M*N trials ? Then it does
> sound like a trivial task at all. I know there exists libraries
> explicitly designed for parallel random number generation - maybe this
> is where we should look, instead of using heuristics which are likely to
> be bogus, and generate wrong results.
>
> cheers,
>
> David
>
This is not sufficient because you can not ensure that the seed will be
different every time task() is called.
A major part of the problem here is treating a parallel computing
problem as a serial computing problem. The streams must be independent
across threads especially avoiding cross-correlation of streams (another
gotcha) between threads. It is up to the user to implement a
thread-safe solution such as using a single stream that is used by all
threads or force the different threads to start at different states. The
only thing that Numpy could do is provide a parallel pseudo-random
number generator.
Bruce
