[IPython-User] engines reconnect after controller restart?
Wed Jan 9 17:26:33 CST 2013
Thanks for this. It's a good step for system wide fault tolerance.
For now, verifying job completion and resubmitting unfinished items can
be managed at application layer. Perhaps down the road, the internal
scheduler can be made persistent using, for example, mongo such that any
un-ackd work can resume gracefully.
On 01/09/2013 05:05 PM, MinRK wrote:
> On Wed, Jan 9, 2013 at 10:36 AM, Min RK <firstname.lastname@example.org
> <mailto:email@example.com>> wrote:
> Yes, there is a preliminary implementation of this in current master.
> Sorry, I should have probably mentioned exactly how you would do this :)
> There are a few changes.
> 1. if you do not set `reuse_files=True`, the controller will cleanup
> its connection files on a clean exit. That means that if you wish to
> stop and restart the controller cleanly, you need to set this (crash
> will generally prevent cleanup).
> 2. engines now have their own heartbeat mechanism, so if the
> controller is down for too long, they will give up themselves.
> The logic here is a maximum number of missed heartbeats,
> so the timeout for engines is EngineFactory.max_heartbeat_misses *
> HeartMonitor.period (default = 50 * 3 ~= 3 minutes). You may want to
> change these two config values if that's not an appropriate time for
> engines to give up.
> 3. to attempt to restore the controller state, do
> ipcontroller --restore
> I doubt that this has been tested out in the world, but I have played
> with stopping and starting the Controller myself. Note that, at
> present, this only re-establishes connections, it does not restore job
> queues or anything, so it is of limited utility.
> On Jan 9, 2013, at 5:50, "Darren Govoni" <firstname.lastname@example.org
> <mailto:email@example.com>> wrote:
>> A good while ago I was asking if iPython could reform its
>> network after a controller restart and Min was gracious enough to
>> make a patch prototype to persist the controller state towards
>> this end.
>> Does the current (or next) release of iPython support
>> controller faults/restarts like this and restablish engine
>> connections on restart?
>> IPython-User mailing list
>> IPython-User@scipy.org <mailto:IPython-User@scipy.org>
> IPython-User mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the IPython-User