[IPython-User] ipython parallel help

Bago mrbago@gmail....
Wed Jun 27 15:03:07 CDT 2012


Hi all, I'm trying to debug a issue in my ipython parallel code that's 
really driving me nuts. I would appreciate any help you guys could 
offer. The issue is that I start my jobs on the ipengines, and some of 
them never seem to complete. I first saw this issue using .11 and 
upgraded to .13beta hoping it would just go away.

I wrote a helper function to keep track of what's going on:

     def helper(seq, worker, *args, **kargs):
         import socket
         f = open("helper%03d.log" % seq[0], 'w')
         f.write("starting helper on " + socket.gethostname() + "\n")
         result = []
         for i in seq:
             f.write('starting iteration %03d\n' % i)
             f.flush()
             result.append(worker(i, *args, **kargs))
             f.write('done with iteration %03d\n' % i)
             f.flush()
         f.write('loop complete\n')
         f.close()
         return result

I've tried using the @parallel decorator with my helper funciton and 
I've tried calling the helper function using a scatter/execute setup, ie:

scatter('small_seq', seq)
execute('result = helper(small_seq, worker, ...))
result = gather('result')

Either way my log file gets filled out as expected, but long after all 
the log files gets to 'loop complete', the status of some of the jobs 
never becomes 'completed'.

The issue seems to only happen after the engines have been running for a 
while, about 3 hours, so it's making really tricky to debug. Everything 
I've done to try and re-create the issue in less than 10 min has failed. 
I wanted to know if there is any way to integrate the engines/controller 
after they've been running for 3 hours to try and figure out what the 
issue might be.

Also something I've noticed, the issue is more likely to happen on 
machines that are far from the controller. Our machines are in two 
separate rooms and when I start the controller/hub on a machine in a 
given room all the machines in that room seem to do ok, but some of the 
machines in the other room exhibit this symptom. Has anyone seen 
anything like this, any advice on how I can debug it?

Thanks for all your help
Bago




More information about the IPython-User mailing list