[IPython-User] ipython parallel help
Wed Jun 27 15:03:07 CDT 2012
Hi all, I'm trying to debug a issue in my ipython parallel code that's
really driving me nuts. I would appreciate any help you guys could
offer. The issue is that I start my jobs on the ipengines, and some of
them never seem to complete. I first saw this issue using .11 and
upgraded to .13beta hoping it would just go away.
I wrote a helper function to keep track of what's going on:
def helper(seq, worker, *args, **kargs):
f = open("helper%03d.log" % seq, 'w')
f.write("starting helper on " + socket.gethostname() + "\n")
result = 
for i in seq:
f.write('starting iteration %03d\n' % i)
result.append(worker(i, *args, **kargs))
f.write('done with iteration %03d\n' % i)
I've tried using the @parallel decorator with my helper funciton and
I've tried calling the helper function using a scatter/execute setup, ie:
execute('result = helper(small_seq, worker, ...))
result = gather('result')
Either way my log file gets filled out as expected, but long after all
the log files gets to 'loop complete', the status of some of the jobs
never becomes 'completed'.
The issue seems to only happen after the engines have been running for a
while, about 3 hours, so it's making really tricky to debug. Everything
I've done to try and re-create the issue in less than 10 min has failed.
I wanted to know if there is any way to integrate the engines/controller
after they've been running for 3 hours to try and figure out what the
issue might be.
Also something I've noticed, the issue is more likely to happen on
machines that are far from the controller. Our machines are in two
separate rooms and when I start the controller/hub on a machine in a
given room all the machines in that room seem to do ok, but some of the
machines in the other room exhibit this symptom. Has anyone seen
anything like this, any advice on how I can debug it?
Thanks for all your help
More information about the IPython-User