[IPython-User] Making an ipython cluster more stable

Caius Howcroft caius.howcroft@gmail....
Fri Feb 3 15:18:53 CST 2012


Hi MinRK

(sorry reopening an old thread)

I saw on github you guys made some changes to ZMQStreams (
https://github.com/ipython/ipython/issues/1304 ) to try fix these
heartbeat issues so I thought "great", and installed the latest from
github master branch.

However, I'm still seeing the exact same heartbeat warnings followed
by the whole cluster coming down. I tried upping the Heartbeat.period
to 10000 however this caused the cluster to never get going... see
printout below.


2012-02-03 21:10:59.565 [IPClusterStart] Starting SSHEngineLauncher:
['ssh', '-tt', '-o', 'StrictHostKeyChecking no',
u'ip-10-80-186-44.ec2.internal', '/usr/bin/python',
u'/usr/local/lib/python2.7/dist-packages/ipython-0.12-py2.7.egg/IPython/parallel/apps/ipengineapp.py',
'--profile-dir', u'/home/chowcroft/.ipython/profile_default',
'--log-level=20']
2012-02-03 21:10:59.639 [IPClusterStart] Process 'ssh' started: 6097
2012-02-03 21:10:59.641 [IPClusterStart] Process 'engine set' started:
[None, None, None, None, None, None, None, None, None, None, None,
None, None, None, None, None, None, None, None, None, None]
2012-02-03 21:10:59.645 [IPClusterStart] Start the IPython controller
for parallel computing.
2012-02-03 21:10:59.646 [IPClusterStart] Connection to
ip-10-114-41-115.ec2.internal closed.
2012-02-03 21:10:59.646 [IPClusterStart] Process 'ssh' stopped:
{'pid': 6042, 'exit_code': 1}
2012-02-03 21:10:59.646 [IPClusterStart] IPython cluster: stopping


Not quite sure what to do... it there something else that's timing out?

Many thanks

Caius


On Wed, Jan 11, 2012 at 5:54 PM, MinRK <benjaminrk@gmail.com> wrote:
>
>
> On Wed, Jan 11, 2012 at 10:59, Caius Howcroft <caius.howcroft@gmail.com>
> wrote:
>>
>> Hi everyone
>>
>>
>> I'm running 0.12 on a linux cluster here and generally it has been
>> great. However, some users are complaining that their jobs crash
>> periodically with messages like this:
>> [IPClusterStart] [IPControllerApp] heartbeat::got bad heartbeat
>> (possibly old?): 1326305300.61 (current=1326305302.613)
>> [IPClusterStart] [IPControllerApp] heartbeat::heart
>> '8f9428e4-543f-4218-b7b3-b32d57caa496' missed a beat, and took 1961.06
>> ms to respond
>> [IPClusterStart] [IPControllerApp] heartbeat::ignoring new heart:
>> '8f9428e4-543f-4218-b7b3-b32d57caa496'
>>
>>
>> Generally this brings everything to halt. So my question is two fold:
>> Firstly, I'm pretty sure my machines are synced up well, so I dont
>> think its anything I'm doing. Has anyone had this same problem?
>
>
> I have never seen this myself, but I have heard of it from one other
> user. It is very hard to debug, as it often takes many hours to reproduce.
>  When this happens, an engine is treated as dead, even though it actually
> isn't.  One option is to increase the heartbeat timeout to something more
> like ten or thirty seconds (HeartbeatMonitor.period = 10000).  Note that
> this directly affects the amount of time it takes for the new engine
> registration process, because an engine is not deemed to be ready until its
> heart has started beating.
>
>
>>
>>
>> Secondly, clearly at some point somewhere a machine is going to belly
>> up and I want to make the processing robust to that. How do I
>> configure LoadBalancedView to notice that a machine has gone bad and
>> reassign jobs that were assigned to that machine? I notice that when
>> we launch a set of jobs, they all get assigned to engines immediately,
>> cI think I can change this by changing . c.TaskScheduler.hwm=1
>
>
> Yes, hwm=1 prevents any greedy scheduling of tasks at the expense of hiding
> network latency behind computation time, and increasing load on the
> scheduler.
>
>>
>> , but
>> how do I tell the task manager to reconfigure the load if a node goes
>> bad or a job fails.
>
>
> Tasks can be resubmitted (explicitly requested from the client) or retried
> (handled inside the scheduler).
> If the most likely cause of a task failure is not
> that the task itself has a bug, then you can set `retries=1` or even 5 in
> the unlikely
> event that it gets run on multiple engines that go down.  retries should not
> be greater than the number of engines you have, because a task will *not* be
> resubmitted to an engine where it failed.
>
> You can set the default value for retries submitted with a given view:
>
> lbview.retries = 5
>
> Or you can set it for a single block with temp_flags:
>
> with lbview.temp_flags(retries=5):
>     lbview.apply(retrying_task)
>
> One should be careful with this, because you can actually bring down your
> entire cluster by retrying
> a segfaulting task too many times:
>
> def segfault():
>     """This will crash a linux system; equivalent calls can be made on
> Windows or Mac"""
>     from ctypes import CDLL
>     libc = CDLL("libc.so.6")
>     libc.time(-1)  # BOOM!!
>
> with lbview.temp_flags(retries=len(lbview.client.ids)):
>     lbview.apply(segfault)
>
> -MinRK
>
>
>>
>>
>> Cheers
>>
>> Caius
>> _______________________________________________
>> IPython-User mailing list
>> IPython-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-user
>
>


More information about the IPython-User mailing list