[IPython-User] Making an ipython cluster more stable

MinRK benjaminrk@gmail....
Fri Feb 3 15:33:42 CST 2012


On Fri, Feb 3, 2012 at 13:18, Caius Howcroft <caius.howcroft@gmail.com>wrote:

> Hi MinRK
>
> (sorry reopening an old thread)
>
> I saw on github you guys made some changes to ZMQStreams (
> https://github.com/ipython/ipython/issues/1304 ) to try fix these
> heartbeat issues so I thought "great", and installed the latest from
> github master branch.
>
> However, I'm still seeing the exact same heartbeat warnings followed
> by the whole cluster coming down. I tried upping the Heartbeat.period
> to 10000 however this caused the cluster to never get going... see
> printout below.
>

Unlike the two-process heartbeat used in the qtconsole and notebook, the
parallel code does not consider an engine alive until it has responded to
its first heartbeat.  That means setting a long heartbeat directly affects
the amount of time it takes for an engine to become available (always
greater than one heartbeat, and ~always less than two).

How long does it take your cluster to come down?  How many engines are you
using? What sort of interconnect / env?  What Python, libzmq, and pyzmq
versions?  How long do your individual tasks take?  How much data are you
moving in your tasks?

I'm perpetually baffled by this issue, because I've been keeping clusters
up for days on end, sending 100s of GB of data, and I've never once seen
this happen.


>
> 2012-02-03 21:10:59.565 [IPClusterStart] Starting SSHEngineLauncher:
> ['ssh', '-tt', '-o', 'StrictHostKeyChecking no',
> u'ip-10-80-186-44.ec2.internal', '/usr/bin/python',
>
> u'/usr/local/lib/python2.7/dist-packages/ipython-0.12-py2.7.egg/IPython/parallel/apps/ipengineapp.py',
> '--profile-dir', u'/home/chowcroft/.ipython/profile_default',
> '--log-level=20']
> 2012-02-03 21:10:59.639 [IPClusterStart] Process 'ssh' started: 6097
> 2012-02-03 21:10:59.641 [IPClusterStart] Process 'engine set' started:
> [None, None, None, None, None, None, None, None, None, None, None,
> None, None, None, None, None, None, None, None, None, None]
> 2012-02-03 21:10:59.645 [IPClusterStart] Start the IPython controller
> for parallel computing.
> 2012-02-03 21:10:59.646 [IPClusterStart] Connection to
> ip-10-114-41-115.ec2.internal closed.
> 2012-02-03 21:10:59.646 [IPClusterStart] Process 'ssh' stopped:
> {'pid': 6042, 'exit_code': 1}
> 2012-02-03 21:10:59.646 [IPClusterStart] IPython cluster: stopping
>
>
> Not quite sure what to do... it there something else that's timing out?
>

The first step when debugging process startup is always to stop using
ipcluster, and try to replicate what ipcluster does (call ipcontroller
once, and ipengine 0-many times) with `--debug` flags.  IPCluster itself is
a terrific pain to debug.

-MinRK


> Many thanks
>
> Caius
>
>
> On Wed, Jan 11, 2012 at 5:54 PM, MinRK <benjaminrk@gmail.com> wrote:
> >
> >
> > On Wed, Jan 11, 2012 at 10:59, Caius Howcroft <caius.howcroft@gmail.com>
> > wrote:
> >>
> >> Hi everyone
> >>
> >>
> >> I'm running 0.12 on a linux cluster here and generally it has been
> >> great. However, some users are complaining that their jobs crash
> >> periodically with messages like this:
> >> [IPClusterStart] [IPControllerApp] heartbeat::got bad heartbeat
> >> (possibly old?): 1326305300.61 (current=1326305302.613)
> >> [IPClusterStart] [IPControllerApp] heartbeat::heart
> >> '8f9428e4-543f-4218-b7b3-b32d57caa496' missed a beat, and took 1961.06
> >> ms to respond
> >> [IPClusterStart] [IPControllerApp] heartbeat::ignoring new heart:
> >> '8f9428e4-543f-4218-b7b3-b32d57caa496'
> >>
> >>
> >> Generally this brings everything to halt. So my question is two fold:
> >> Firstly, I'm pretty sure my machines are synced up well, so I dont
> >> think its anything I'm doing. Has anyone had this same problem?
> >
> >
> > I have never seen this myself, but I have heard of it from one other
> > user. It is very hard to debug, as it often takes many hours to
> reproduce.
> >  When this happens, an engine is treated as dead, even though it actually
> > isn't.  One option is to increase the heartbeat timeout to something more
> > like ten or thirty seconds (HeartbeatMonitor.period = 10000).  Note that
> > this directly affects the amount of time it takes for the new engine
> > registration process, because an engine is not deemed to be ready until
> its
> > heart has started beating.
> >
> >
> >>
> >>
> >> Secondly, clearly at some point somewhere a machine is going to belly
> >> up and I want to make the processing robust to that. How do I
> >> configure LoadBalancedView to notice that a machine has gone bad and
> >> reassign jobs that were assigned to that machine? I notice that when
> >> we launch a set of jobs, they all get assigned to engines immediately,
> >> cI think I can change this by changing . c.TaskScheduler.hwm=1
> >
> >
> > Yes, hwm=1 prevents any greedy scheduling of tasks at the expense of
> hiding
> > network latency behind computation time, and increasing load on the
> > scheduler.
> >
> >>
> >> , but
> >> how do I tell the task manager to reconfigure the load if a node goes
> >> bad or a job fails.
> >
> >
> > Tasks can be resubmitted (explicitly requested from the client) or
> retried
> > (handled inside the scheduler).
> > If the most likely cause of a task failure is not
> > that the task itself has a bug, then you can set `retries=1` or even 5 in
> > the unlikely
> > event that it gets run on multiple engines that go down.  retries should
> not
> > be greater than the number of engines you have, because a task will
> *not* be
> > resubmitted to an engine where it failed.
> >
> > You can set the default value for retries submitted with a given view:
> >
> > lbview.retries = 5
> >
> > Or you can set it for a single block with temp_flags:
> >
> > with lbview.temp_flags(retries=5):
> >     lbview.apply(retrying_task)
> >
> > One should be careful with this, because you can actually bring down your
> > entire cluster by retrying
> > a segfaulting task too many times:
> >
> > def segfault():
> >     """This will crash a linux system; equivalent calls can be made on
> > Windows or Mac"""
> >     from ctypes import CDLL
> >     libc = CDLL("libc.so.6")
> >     libc.time(-1)  # BOOM!!
> >
> > with lbview.temp_flags(retries=len(lbview.client.ids)):
> >     lbview.apply(segfault)
> >
> > -MinRK
> >
> >
> >>
> >>
> >> Cheers
> >>
> >> Caius
> >> _______________________________________________
> >> IPython-User mailing list
> >> IPython-User@scipy.org
> >> http://mail.scipy.org/mailman/listinfo/ipython-user
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20120203/0fda154d/attachment.html 


More information about the IPython-User mailing list