[IPython-User] Making an ipython cluster more stable

Caius Howcroft caius.howcroft@gmail....
Wed Jan 11 12:59:19 CST 2012


Hi everyone


I'm running 0.12 on a linux cluster here and generally it has been
great. However, some users are complaining that their jobs crash
periodically with messages like this:
[IPClusterStart] [IPControllerApp] heartbeat::got bad heartbeat
(possibly old?): 1326305300.61 (current=1326305302.613)
[IPClusterStart] [IPControllerApp] heartbeat::heart
'8f9428e4-543f-4218-b7b3-b32d57caa496' missed a beat, and took 1961.06
ms to respond
[IPClusterStart] [IPControllerApp] heartbeat::ignoring new heart:
'8f9428e4-543f-4218-b7b3-b32d57caa496'

Generally this brings everything to halt. So my question is two fold:
Firstly, I'm pretty sure my machines are synced up well, so I dont
think its anything I'm doing. Has anyone had this same problem?

Secondly, clearly at some point somewhere a machine is going to belly
up and I want to make the processing robust to that. How do I
configure LoadBalancedView to notice that a machine has gone bad and
reassign jobs that were assigned to that machine? I notice that when
we launch a set of jobs, they all get assigned to engines immediately,
cI think I can change this by changing . c.TaskScheduler.hwm=1, but
how do I tell the task manager to reconfigure the load if a node goes
bad or a job fails.

Cheers

Caius


More information about the IPython-User mailing list