[IPython-User] ipython parallel help
Wed Jun 27 15:44:38 CDT 2012
My longer jobs seem to fail after multiple hours with messages like
got stale result: f75e3ed7-1781-47db-9b08-6f77cea02166
EngineError(Engine 52 died while running task
However, I haven't been able to determine if the problem is with IPython or
with the cluster.
On Wed, Jun 27, 2012 at 4:03 PM, Bago <firstname.lastname@example.org> wrote:
> Hi all, I'm trying to debug a issue in my ipython parallel code that's
> really driving me nuts. I would appreciate any help you guys could
> offer. The issue is that I start my jobs on the ipengines, and some of
> them never seem to complete. I first saw this issue using .11 and
> upgraded to .13beta hoping it would just go away.
> I wrote a helper function to keep track of what's going on:
> def helper(seq, worker, *args, **kargs):
> import socket
> f = open("helper%03d.log" % seq, 'w')
> f.write("starting helper on " + socket.gethostname() + "\n")
> result = 
> for i in seq:
> f.write('starting iteration %03d\n' % i)
> result.append(worker(i, *args, **kargs))
> f.write('done with iteration %03d\n' % i)
> f.write('loop complete\n')
> return result
> I've tried using the @parallel decorator with my helper funciton and
> I've tried calling the helper function using a scatter/execute setup, ie:
> scatter('small_seq', seq)
> execute('result = helper(small_seq, worker, ...))
> result = gather('result')
> Either way my log file gets filled out as expected, but long after all
> the log files gets to 'loop complete', the status of some of the jobs
> never becomes 'completed'.
> The issue seems to only happen after the engines have been running for a
> while, about 3 hours, so it's making really tricky to debug. Everything
> I've done to try and re-create the issue in less than 10 min has failed.
> I wanted to know if there is any way to integrate the engines/controller
> after they've been running for 3 hours to try and figure out what the
> issue might be.
> Also something I've noticed, the issue is more likely to happen on
> machines that are far from the controller. Our machines are in two
> separate rooms and when I start the controller/hub on a machine in a
> given room all the machines in that room seem to do ok, but some of the
> machines in the other room exhibit this symptom. Has anyone seen
> anything like this, any advice on how I can debug it?
> Thanks for all your help
> IPython-User mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the IPython-User