My longer jobs seem to fail after multiple hours with messages like<div><br></div><div><div> got stale result: f75e3ed7-1781-47db-9b08-6f77cea02166</div><div> EngineError(Engine 52 died while running task 'f75e3ed7-1781-47db-9b08-6f77cea02166')</div>
<div><br></div><div>However, I haven't been able to determine if the problem is with IPython or with the cluster.</div><div><br></div><div>-Robert</div><br><div class="gmail_quote">On Wed, Jun 27, 2012 at 4:03 PM, Bago <span dir="ltr"><<a href="mailto:mrbago@gmail.com" target="_blank">mrbago@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all, I'm trying to debug a issue in my ipython parallel code that's<br>
really driving me nuts. I would appreciate any help you guys could<br>
offer. The issue is that I start my jobs on the ipengines, and some of<br>
them never seem to complete. I first saw this issue using .11 and<br>
upgraded to .13beta hoping it would just go away.<br>
<br>
I wrote a helper function to keep track of what's going on:<br>
<br>
def helper(seq, worker, *args, **kargs):<br>
import socket<br>
f = open("helper%03d.log" % seq[0], 'w')<br>
f.write("starting helper on " + socket.gethostname() + "\n")<br>
result = []<br>
for i in seq:<br>
f.write('starting iteration %03d\n' % i)<br>
f.flush()<br>
result.append(worker(i, *args, **kargs))<br>
f.write('done with iteration %03d\n' % i)<br>
f.flush()<br>
f.write('loop complete\n')<br>
f.close()<br>
return result<br>
<br>
I've tried using the @parallel decorator with my helper funciton and<br>
I've tried calling the helper function using a scatter/execute setup, ie:<br>
<br>
scatter('small_seq', seq)<br>
execute('result = helper(small_seq, worker, ...))<br>
result = gather('result')<br>
<br>
Either way my log file gets filled out as expected, but long after all<br>
the log files gets to 'loop complete', the status of some of the jobs<br>
never becomes 'completed'.<br>
<br>
The issue seems to only happen after the engines have been running for a<br>
while, about 3 hours, so it's making really tricky to debug. Everything<br>
I've done to try and re-create the issue in less than 10 min has failed.<br>
I wanted to know if there is any way to integrate the engines/controller<br>
after they've been running for 3 hours to try and figure out what the<br>
issue might be.<br>
<br>
Also something I've noticed, the issue is more likely to happen on<br>
machines that are far from the controller. Our machines are in two<br>
separate rooms and when I start the controller/hub on a machine in a<br>
given room all the machines in that room seem to do ok, but some of the<br>
machines in the other room exhibit this symptom. Has anyone seen<br>
anything like this, any advice on how I can debug it?<br>
<br>
Thanks for all your help<br>
Bago<br>
<br>
<br>
_______________________________________________<br>
IPython-User mailing list<br>
<a href="mailto:IPython-User@scipy.org">IPython-User@scipy.org</a><br>
<a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
</blockquote></div><br></div>