[IPython-User] ipython parallel help

Robert Nishihara robertnishihara@gmail....
Wed Jun 27 15:44:38 CDT 2012


My longer jobs seem to fail after multiple hours with messages like

    got stale result: f75e3ed7-1781-47db-9b08-6f77cea02166
    EngineError(Engine 52 died while running task
'f75e3ed7-1781-47db-9b08-6f77cea02166')

However, I haven't been able to determine if the problem is with IPython or
with the cluster.

-Robert

On Wed, Jun 27, 2012 at 4:03 PM, Bago <mrbago@gmail.com> wrote:

> Hi all, I'm trying to debug a issue in my ipython parallel code that's
> really driving me nuts. I would appreciate any help you guys could
> offer. The issue is that I start my jobs on the ipengines, and some of
> them never seem to complete. I first saw this issue using .11 and
> upgraded to .13beta hoping it would just go away.
>
> I wrote a helper function to keep track of what's going on:
>
>     def helper(seq, worker, *args, **kargs):
>         import socket
>         f = open("helper%03d.log" % seq[0], 'w')
>         f.write("starting helper on " + socket.gethostname() + "\n")
>         result = []
>         for i in seq:
>             f.write('starting iteration %03d\n' % i)
>             f.flush()
>             result.append(worker(i, *args, **kargs))
>             f.write('done with iteration %03d\n' % i)
>             f.flush()
>         f.write('loop complete\n')
>         f.close()
>         return result
>
> I've tried using the @parallel decorator with my helper funciton and
> I've tried calling the helper function using a scatter/execute setup, ie:
>
> scatter('small_seq', seq)
> execute('result = helper(small_seq, worker, ...))
> result = gather('result')
>
> Either way my log file gets filled out as expected, but long after all
> the log files gets to 'loop complete', the status of some of the jobs
> never becomes 'completed'.
>
> The issue seems to only happen after the engines have been running for a
> while, about 3 hours, so it's making really tricky to debug. Everything
> I've done to try and re-create the issue in less than 10 min has failed.
> I wanted to know if there is any way to integrate the engines/controller
> after they've been running for 3 hours to try and figure out what the
> issue might be.
>
> Also something I've noticed, the issue is more likely to happen on
> machines that are far from the controller. Our machines are in two
> separate rooms and when I start the controller/hub on a machine in a
> given room all the machines in that room seem to do ok, but some of the
> machines in the other room exhibit this symptom. Has anyone seen
> anything like this, any advice on how I can debug it?
>
> Thanks for all your help
> Bago
>
>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20120627/90b23e9a/attachment.html 


More information about the IPython-User mailing list