[IPython-User] Making an ipython cluster more stable
Wed Jan 11 16:54:36 CST 2012
On Wed, Jan 11, 2012 at 10:59, Caius Howcroft <firstname.lastname@example.org>wrote:
> Hi everyone
> I'm running 0.12 on a linux cluster here and generally it has been
> great. However, some users are complaining that their jobs crash
> periodically with messages like this:
> [IPClusterStart] [IPControllerApp] heartbeat::got bad heartbeat
> (possibly old?): 1326305300.61 (current=1326305302.613)
> [IPClusterStart] [IPControllerApp] heartbeat::heart
> '8f9428e4-543f-4218-b7b3-b32d57caa496' missed a beat, and took 1961.06
> ms to respond
> [IPClusterStart] [IPControllerApp] heartbeat::ignoring new heart:
> Generally this brings everything to halt. So my question is two fold:
> Firstly, I'm pretty sure my machines are synced up well, so I dont
> think its anything I'm doing. Has anyone had this same problem?
I have never seen this myself, but I have heard of it from one other
user. It is very hard to debug, as it often takes many hours to reproduce.
When this happens, an engine is treated as dead, even though it actually
isn't. One option is to increase the heartbeat timeout to something more
like ten or thirty seconds (HeartbeatMonitor.period = 10000). Note that
this directly affects the amount of time it takes for the new engine
registration process, because an engine is not deemed to be ready until its
heart has started beating.
> Secondly, clearly at some point somewhere a machine is going to belly
> up and I want to make the processing robust to that. How do I
> configure LoadBalancedView to notice that a machine has gone bad and
> reassign jobs that were assigned to that machine? I notice that when
> we launch a set of jobs, they all get assigned to engines immediately,
> cI think I can change this by changing . c.TaskScheduler.hwm=1
Yes, hwm=1 prevents any greedy scheduling of tasks at the expense of hiding
network latency behind computation time, and increasing load on the
> , but
> how do I tell the task manager to reconfigure the load if a node goes
> bad or a job fails.
Tasks can be resubmitted (explicitly requested from the client) or retried
(handled inside the scheduler).
If the most likely cause of a task failure is not
that the task itself has a bug, then you can set `retries=1` or even 5 in
event that it gets run on multiple engines that go down. retries should not
be greater than the number of engines you have, because a task will *not* be
resubmitted to an engine where it failed.
You can set the default value for retries submitted with a given view:
lbview.retries = 5
Or you can set it for a single block with temp_flags:
One should be careful with this, because you can actually bring down your
entire cluster by retrying
a segfaulting task too many times:
"""This will crash a linux system; equivalent calls can be made on
Windows or Mac"""
from ctypes import CDLL
libc = CDLL("libc.so.6")
libc.time(-1) # BOOM!!
> IPython-User mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the IPython-User