[IPython-User] ipcontroller failover?

MinRK benjaminrk@gmail....
Tue Mar 6 15:49:54 CST 2012


Can I ask more about what your environment is like, and the typical
circumstances of controller shutdown / crash?

How often does the controller die, how many tasks are pending in the
Schedulers, and how many are active on engines when this happens?  What are
your expectations/hopes/dreams for behavior if the controller goes down
while a bunch of work is in-flight?

-MinRK

On Tue, Mar 6, 2012 at 13:20, <darren@ontrenet.com> wrote:

> Wow. Awesome. Let me try it. Many thanks.
>
> > You might check out this first-go implementation:
> >
> > https://github.com/ipython/ipython/pull/1471
> >
> > It seems to work fine if the cluster was idle at controller crash, but I
> > haven't tested the behavior of running jobs.  I'm certain that the
> > propagation of results of jobs submitted before shutdown all the way up
> to
> > interactive Clients is broken, but the results should still arrive in the
> > Hub's db.
> >
> > -MinRK
> >
> >
> > On Mon, Mar 5, 2012 at 16:38, MinRK <benjaminrk@gmail.com> wrote:
> >
> >> Correct, engines do not reconnect to a new controller, and right now a
> >> Controller is a single point of failure.
> >>
> >> We absolutely do intend to enable restarting the controller, and it
> >> wouldn't be remotely difficult, the code just isn't written yet.
> >>
> >> Steps required for this:
> >>
> >> 1. persist engine connection state to files/db (engine ID/UUID mapping
> >> should)
> >> 2. when starting up, load this information into the Hub, instead of
> >> starting from scratch
> >>
> >> That is all.  No change should be required in the engines or clients, as
> >> zeromq handles the reconnect automagically.
> >>
> >> There is already enough information stored in the *task* database to
> >> resume all tasks that were waiting in the Scheduler, but I'm not sure
> >> whether this should be done by default, or only on request.
> >>
> >> -MinRK
> >>
> >> On Mon, Mar 5, 2012 at 15:17, Darren Govoni <darren@ontrenet.com>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
> >>> > It may also be unnecessary, because if the controller comes up at the
> >>> > same endpoint(s), then zeromq handles all the reconnects invisibly.
> >>> A
> >>> > connection to an endpoint is always valid, whether or not there is a
> >>> > socket present at any given point in time.
> >>>
> >>>   I tried an example to see this. I ran an ipcontroller on one machine
> >>> with static --port=21001 so engine client files would always be valid.
> >>>
> >>
> >> Just specifying the registration port isn't enough information, and you
> >> should be using `--reuse` or `IPControllerApp.reuse_files=True` for
> >> connection files to remain valid across sessions.
> >>
> >>
> >>>
> >>> I connected one engine from another server.
> >>>
> >>> I killed the controller and restarted it.
> >>>
> >>> After doing:
> >>>
> >>> client = Client()
> >>> client.ids
> >>> []
> >>>
> >>> There are no longer any engines connected.
> >>>
> >>> dview = client[:]
> >>> ...
> >>> NoEnginesRegistered: Can't build targets without any engines
> >>>
> >>> The problem perhaps is that for any large scale system, say 1
> >>> controller
> >>> with 50 engines running on 50 servers, this single-point-of-failure is
> >>> hard to remedy.
> >>>
> >>> Is there a way to tell the controller to reconnect to last known engine
> >>> IP addresses? Or some other way to re-establish the grid? Rebooting 50
> >>> servers is not a good option for us.
> >>>
> >>> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
> >>> >
> >>> >
> >>> > On Sun, Feb 12, 2012 at 13:02, Darren Govoni <darren@ontrenet.com>
> >>> > wrote:
> >>> >         Correct me if I'm wrong, but do the ipengines 'connect' or
> >>> >         otherwise
> >>> >         announce their presence to the controller?
> >>> >
> >>> >
> >>> > Yes, 100% of the connections are inbound to the controller processes,
> >>> > from clients and engines alike.  This is a strict requirement,
> >>> because
> >>> > it would not be acceptable for engines to need open ports for inbound
> >>> > connections.  Simply bringing up a new controller with the same
> >>> > connection information would result in the cluster continuing to
> >>> > function, with the engines and client never realizing the controller
> >>> > went down at all, nor having to act on it in any way.
> >>> >
> >>> >         If it were the other way
> >>> >         around, then this would accommodate some degree of fault
> >>> >         tolerance for
> >>> >         the controller because it could be restarted by a watching
> >>> dog
> >>> >         and the
> >>> >         re-establish the connected state of the cluster. i.e. a
> >>> >         controller comes
> >>> >         online. a pub/sub message is sent to a known channel and
> >>> >         clients or
> >>> >         engines add the new ipcontroller to its internal list as a
> >>> >         failover
> >>> >         endpoint.
> >>> >
> >>> >
> >>> > This is still possible without reversing connection direction.  Note
> >>> > that in zeromq there is *exactly zero* correlation between
> >>> > communication direction and connection direction.  PUB can connect to
> >>> > SUB, and vice versa.  In fact a single socket can bind and connect at
> >>> > the same time.
> >>> >
> >>> >
> >>> > It may also be unnecessary, because if the controller comes up at the
> >>> > same endpoint(s), then zeromq handles all the reconnects invisibly.
> >>> A
> >>> > connection to an endpoint is always valid, whether or not there is a
> >>> > socket present at any given point in time.
> >>> >
> >>> >
> >>> >         On Sun, 2012-02-12 at 12:06 -0800, MinRK wrote:
> >>> >         >
> >>> >         >
> >>> >         > On Sun, Feb 12, 2012 at 11:48, Darren Govoni
> >>> >         <darren@ontrenet.com>
> >>> >         > wrote:
> >>> >         >         On Sun, 2012-02-12 at 11:12 -0800, MinRK wrote:
> >>> >         >         >
> >>> >         >         >
> >>> >         >         > On Sun, Feb 12, 2012 at 10:42, Darren Govoni
> >>> >         >         <darren@ontrenet.com>
> >>> >         >         > wrote:
> >>> >         >         >         Thanks Min,
> >>> >         >         >
> >>> >         >         >         Is it possible to open a ticket for this
> >>> >         capability
> >>> >         >         for a
> >>> >         >         >         (near) future
> >>> >         >         >         release? It compliments that already
> >>> >         amazing load
> >>> >         >         balancing
> >>> >         >         >         capability.
> >>> >         >         >
> >>> >         >         >
> >>> >         >         > You are welcome to open an Issue.  I don't know
> >>> if
> >>> >         it will
> >>> >         >         make it
> >>> >         >         > into one of the next few releases, but it is on
> >>> my
> >>> >         todo
> >>> >         >         list.  The
> >>> >         >         > best way to get this sort of thing going is to
> >>> >         start with a
> >>> >         >         Pull
> >>> >         >         > Request.
> >>> >         >
> >>> >         >
> >>> >         >         Ok, I will open an issue. Thanks. In the meantime,
> >>> >         is it
> >>> >         >         possible for
> >>> >         >         clients to 'know' when a controller is no longer
> >>> >         available?
> >>> >         >         For example,
> >>> >         >         it would be nice if I can insert a callback handler
> >>> >         for this
> >>> >         >         sort of
> >>> >         >         internal exception so I can provide some graceful
> >>> >         recovery
> >>> >         >         options.
> >>> >         >
> >>> >         >
> >>> >         > It would be sensible to add a heartbeat mechanism on the
> >>> >         > controller->client PUB channel for this information.  Until
> >>> >         then, your
> >>> >         > main controller crash detection is going to be simple
> >>> >         timeouts.
> >>> >         >
> >>> >         >
> >>> >         > ZeroMQ makes disconnect detection a challenge (because
> >>> there
> >>> >         are no
> >>> >         > disconnect events, because a disconnected channel is still
> >>> >         valid, as
> >>> >         > the peer is allowed to just come back up).
> >>> >         >
> >>> >         >
> >>> >         >         >
> >>> >         >         >
> >>> >         >         >         Perhaps a related but separate notion
> >>> >         would be the
> >>> >         >         ability to
> >>> >         >         >         have
> >>> >         >         >         clustered controllers for HA.
> >>> >         >         >
> >>> >         >         >
> >>> >         >         > I do have a model in mind for this sort of thing,
> >>> >         though not
> >>> >         >         multiple
> >>> >         >         > *controllers*, rather multiple Schedulers.  Our
> >>> >         design with
> >>> >         >         0MQ would
> >>> >         >         > make this pretty simple (just start another
> >>> >         scheduler, and
> >>> >         >         make an
> >>> >         >         > extra call to socket.connect() on the Client and
> >>> >         Engine is
> >>> >         >         all that's
> >>> >         >         > needed), and this should allow scaling to tens of
> >>> >         thousands
> >>> >         >         of
> >>> >         >         > engines.
> >>> >         >
> >>> >         >
> >>> >         >         Yes! That's what I'm after. In this cloud-scale age
> >>> >         of
> >>> >         >         computing, that
> >>> >         >         would be ideal.
> >>> >         >
> >>> >         >
> >>> >         >         Thanks Min.
> >>> >         >
> >>> >         >         >
> >>> >         >         >
> >>> >         >         >         On Sun, 2012-02-12 at 08:32 -0800, Min RK
> >>> >         wrote:
> >>> >         >         >         > No, there is no failover mechanism.
> >>> >          When the
> >>> >         >         controller
> >>> >         >         >         goes down, further requests will simply
> >>> >         hang.  We
> >>> >         >         have almost
> >>> >         >         >         all the information we need to bring up a
> >>> >         new
> >>> >         >         controller in
> >>> >         >         >         its place (restart it), in which case the
> >>> >         Client
> >>> >         >         wouldn't even
> >>> >         >         >         need to know that it went down, and would
> >>> >         continue
> >>> >         >         to just
> >>> >         >         >         work, thanks to some zeromq magic.
> >>> >         >         >         >
> >>> >         >         >         > -MinRK
> >>> >         >         >         >
> >>> >         >         >         > On Feb 12, 2012, at 5:02, Darren Govoni
> >>> >         >         >         <darren@ontrenet.com> wrote:
> >>> >         >         >         >
> >>> >         >         >         > > Hi,
> >>> >         >         >         > >  Does ipython support any kind of
> >>> >         clustering or
> >>> >         >         failover
> >>> >         >         >         for
> >>> >         >         >         > > ipcontrollers? I'm wondering how
> >>> >         situations are
> >>> >         >         handled
> >>> >         >         >         where a
> >>> >         >         >         > > controller goes down when a client
> >>> >         needs to
> >>> >         >         perform
> >>> >         >         >         something.
> >>> >         >         >         > >
> >>> >         >         >         > > thanks for any tips.
> >>> >         >         >         > > Darren
> >>> >         >         >         > >
> >>> >         >         >         > >
> >>> >         _______________________________________________
> >>> >         >         >         > > IPython-User mailing list
> >>> >         >         >         > > IPython-User@scipy.org
> >>> >         >         >         > >
> >>> >         >
> http://mail.scipy.org/mailman/listinfo/ipython-user
> >>> >         >         >         >
> >>> >         _______________________________________________
> >>> >         >         >         > IPython-User mailing list
> >>> >         >         >         > IPython-User@scipy.org
> >>> >         >         >         >
> >>> >         >
> http://mail.scipy.org/mailman/listinfo/ipython-user
> >>> >         >         >
> >>> >         >         >
> >>> >         >         >
> >>> >         _______________________________________________
> >>> >         >         >         IPython-User mailing list
> >>> >         >         >         IPython-User@scipy.org
> >>> >         >         >
> >>> >         http://mail.scipy.org/mailman/listinfo/ipython-user
> >>> >         >         >
> >>> >         >         >
> >>> >         >         > _______________________________________________
> >>> >         >         > IPython-User mailing list
> >>> >         >         > IPython-User@scipy.org
> >>> >         >         >
> >>> >         http://mail.scipy.org/mailman/listinfo/ipython-user
> >>> >         >
> >>> >         >
> >>> >         >         _______________________________________________
> >>> >         >         IPython-User mailing list
> >>> >         >         IPython-User@scipy.org
> >>> >         >
> http://mail.scipy.org/mailman/listinfo/ipython-user
> >>> >         >
> >>> >         >
> >>> >         > _______________________________________________
> >>> >         > IPython-User mailing list
> >>> >         > IPython-User@scipy.org
> >>> >         > http://mail.scipy.org/mailman/listinfo/ipython-user
> >>> >
> >>> >
> >>> >         _______________________________________________
> >>> >         IPython-User mailing list
> >>> >         IPython-User@scipy.org
> >>> >         http://mail.scipy.org/mailman/listinfo/ipython-user
> >>> >
> >>> >
> >>> > _______________________________________________
> >>> > IPython-User mailing list
> >>> > IPython-User@scipy.org
> >>> > http://mail.scipy.org/mailman/listinfo/ipython-user
> >>>
> >>>
> >>> _______________________________________________
> >>> IPython-User mailing list
> >>> IPython-User@scipy.org
> >>> http://mail.scipy.org/mailman/listinfo/ipython-user
> >>>
> >>
> >>
> > _______________________________________________
> > IPython-User mailing list
> > IPython-User@scipy.org
> > http://mail.scipy.org/mailman/listinfo/ipython-user
> >
>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20120306/81738929/attachment-0001.html 


More information about the IPython-User mailing list