[IPython-User] engines reconnect after controller restart?
Wed Jan 9 17:33:22 CST 2013
On Jan 9, 2013, at 15:26, Darren Govoni <email@example.com> wrote:
> Thanks for this. It's a good step for system wide fault tolerance. For now, verifying job completion and resubmitting unfinished items can be managed at application layer. Perhaps down the road, the internal scheduler can be made persistent using, for example, mongo such that any un-ackd work can resume gracefully.
indeed - the Hub already uses mongo to persist tasks, so all the information is there.
> On 01/09/2013 05:05 PM, MinRK wrote:
>> On Wed, Jan 9, 2013 at 10:36 AM, Min RK <firstname.lastname@example.org> wrote:
>>> Yes, there is a preliminary implementation of this in current master.
>> Sorry, I should have probably mentioned exactly how you would do this :)
>> There are a few changes.
>> 1. if you do not set `reuse_files=True`, the controller will cleanup its connection files on a clean exit. That means that if you wish to stop and restart the controller cleanly, you need to set this (crash will generally prevent cleanup).
>> 2. engines now have their own heartbeat mechanism, so if the controller is down for too long, they will give up themselves.
>> The logic here is a maximum number of missed heartbeats,
>> so the timeout for engines is EngineFactory.max_heartbeat_misses * HeartMonitor.period (default = 50 * 3 ~= 3 minutes). You may want to change these two config values if that's not an appropriate time for engines to give up.
>> 3. to attempt to restore the controller state, do
>> ipcontroller --restore
>> I doubt that this has been tested out in the world, but I have played with stopping and starting the Controller myself. Note that, at present, this only re-establishes connections, it does not restore job queues or anything, so it is of limited utility.
>>> On Jan 9, 2013, at 5:50, "Darren Govoni" <email@example.com> wrote:
>>>> A good while ago I was asking if iPython could reform its network after a controller restart and Min was gracious enough to make a patch prototype to persist the controller state towards this end.
>>>> Does the current (or next) release of iPython support controller faults/restarts like this and restablish engine connections on restart?
>>>> IPython-User mailing list
>> IPython-User mailing list
> IPython-User mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the IPython-User