[IPython-User] ipcontroller failover?
MinRK
benjaminrk@gmail....
Tue Mar 6 15:49:54 CST 2012
Can I ask more about what your environment is like, and the typical
circumstances of controller shutdown / crash?
How often does the controller die, how many tasks are pending in the
Schedulers, and how many are active on engines when this happens? What are
your expectations/hopes/dreams for behavior if the controller goes down
while a bunch of work is in-flight?
-MinRK
On Tue, Mar 6, 2012 at 13:20, <darren@ontrenet.com> wrote:
> Wow. Awesome. Let me try it. Many thanks.
>
> > You might check out this first-go implementation:
> >
> > https://github.com/ipython/ipython/pull/1471
> >
> > It seems to work fine if the cluster was idle at controller crash, but I
> > haven't tested the behavior of running jobs. I'm certain that the
> > propagation of results of jobs submitted before shutdown all the way up
> to
> > interactive Clients is broken, but the results should still arrive in the
> > Hub's db.
> >
> > -MinRK
> >
> >
> > On Mon, Mar 5, 2012 at 16:38, MinRK <benjaminrk@gmail.com> wrote:
> >
> >> Correct, engines do not reconnect to a new controller, and right now a
> >> Controller is a single point of failure.
> >>
> >> We absolutely do intend to enable restarting the controller, and it
> >> wouldn't be remotely difficult, the code just isn't written yet.
> >>
> >> Steps required for this:
> >>
> >> 1. persist engine connection state to files/db (engine ID/UUID mapping
> >> should)
> >> 2. when starting up, load this information into the Hub, instead of
> >> starting from scratch
> >>
> >> That is all. No change should be required in the engines or clients, as
> >> zeromq handles the reconnect automagically.
> >>
> >> There is already enough information stored in the *task* database to
> >> resume all tasks that were waiting in the Scheduler, but I'm not sure
> >> whether this should be done by default, or only on request.
> >>
> >> -MinRK
> >>
> >> On Mon, Mar 5, 2012 at 15:17, Darren Govoni <darren@ontrenet.com>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
> >>> > It may also be unnecessary, because if the controller comes up at the
> >>> > same endpoint(s), then zeromq handles all the reconnects invisibly.
> >>> A
> >>> > connection to an endpoint is always valid, whether or not there is a
> >>> > socket present at any given point in time.
> >>>
> >>> I tried an example to see this. I ran an ipcontroller on one machine
> >>> with static --port=21001 so engine client files would always be valid.
> >>>
> >>
> >> Just specifying the registration port isn't enough information, and you
> >> should be using `--reuse` or `IPControllerApp.reuse_files=True` for
> >> connection files to remain valid across sessions.
> >>
> >>
> >>>
> >>> I connected one engine from another server.
> >>>
> >>> I killed the controller and restarted it.
> >>>
> >>> After doing:
> >>>
> >>> client = Client()
> >>> client.ids
> >>> []
> >>>
> >>> There are no longer any engines connected.
> >>>
> >>> dview = client[:]
> >>> ...
> >>> NoEnginesRegistered: Can't build targets without any engines
> >>>
> >>> The problem perhaps is that for any large scale system, say 1
> >>> controller
> >>> with 50 engines running on 50 servers, this single-point-of-failure is
> >>> hard to remedy.
> >>>
> >>> Is there a way to tell the controller to reconnect to last known engine
> >>> IP addresses? Or some other way to re-establish the grid? Rebooting 50
> >>> servers is not a good option for us.
> >>>
> >>> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
> >>> >
> >>> >
> >>> > On Sun, Feb 12, 2012 at 13:02, Darren Govoni <darren@ontrenet.com>
> >>> > wrote:
> >>> > Correct me if I'm wrong, but do the ipengines 'connect' or
> >>> > otherwise
> >>> > announce their presence to the controller?
> >>> >
> >>> >
> >>> > Yes, 100% of the connections are inbound to the controller processes,
> >>> > from clients and engines alike. This is a strict requirement,
> >>> because
> >>> > it would not be acceptable for engines to need open ports for inbound
> >>> > connections. Simply bringing up a new controller with the same
> >>> > connection information would result in the cluster continuing to
> >>> > function, with the engines and client never realizing the controller
> >>> > went down at all, nor having to act on it in any way.
> >>> >
> >>> > If it were the other way
> >>> > around, then this would accommodate some degree of fault
> >>> > tolerance for
> >>> > the controller because it could be restarted by a watching
> >>> dog
> >>> > and the
> >>> > re-establish the connected state of the cluster. i.e. a
> >>> > controller comes
> >>> > online. a pub/sub message is sent to a known channel and
> >>> > clients or
> >>> > engines add the new ipcontroller to its internal list as a
> >>> > failover
> >>> > endpoint.
> >>> >
> >>> >
> >>> > This is still possible without reversing connection direction. Note
> >>> > that in zeromq there is *exactly zero* correlation between
> >>> > communication direction and connection direction. PUB can connect to
> >>> > SUB, and vice versa. In fact a single socket can bind and connect at
> >>> > the same time.
> >>> >
> >>> >
> >>> > It may also be unnecessary, because if the controller comes up at the
> >>> > same endpoint(s), then zeromq handles all the reconnects invisibly.
> >>> A
> >>> > connection to an endpoint is always valid, whether or not there is a
> >>> > socket present at any given point in time.
> >>> >
> >>> >
> >>> > On Sun, 2012-02-12 at 12:06 -0800, MinRK wrote:
> >>> > >
> >>> > >
> >>> > > On Sun, Feb 12, 2012 at 11:48, Darren Govoni
> >>> > <darren@ontrenet.com>
> >>> > > wrote:
> >>> > > On Sun, 2012-02-12 at 11:12 -0800, MinRK wrote:
> >>> > > >
> >>> > > >
> >>> > > > On Sun, Feb 12, 2012 at 10:42, Darren Govoni
> >>> > > <darren@ontrenet.com>
> >>> > > > wrote:
> >>> > > > Thanks Min,
> >>> > > >
> >>> > > > Is it possible to open a ticket for this
> >>> > capability
> >>> > > for a
> >>> > > > (near) future
> >>> > > > release? It compliments that already
> >>> > amazing load
> >>> > > balancing
> >>> > > > capability.
> >>> > > >
> >>> > > >
> >>> > > > You are welcome to open an Issue. I don't know
> >>> if
> >>> > it will
> >>> > > make it
> >>> > > > into one of the next few releases, but it is on
> >>> my
> >>> > todo
> >>> > > list. The
> >>> > > > best way to get this sort of thing going is to
> >>> > start with a
> >>> > > Pull
> >>> > > > Request.
> >>> > >
> >>> > >
> >>> > > Ok, I will open an issue. Thanks. In the meantime,
> >>> > is it
> >>> > > possible for
> >>> > > clients to 'know' when a controller is no longer
> >>> > available?
> >>> > > For example,
> >>> > > it would be nice if I can insert a callback handler
> >>> > for this
> >>> > > sort of
> >>> > > internal exception so I can provide some graceful
> >>> > recovery
> >>> > > options.
> >>> > >
> >>> > >
> >>> > > It would be sensible to add a heartbeat mechanism on the
> >>> > > controller->client PUB channel for this information. Until
> >>> > then, your
> >>> > > main controller crash detection is going to be simple
> >>> > timeouts.
> >>> > >
> >>> > >
> >>> > > ZeroMQ makes disconnect detection a challenge (because
> >>> there
> >>> > are no
> >>> > > disconnect events, because a disconnected channel is still
> >>> > valid, as
> >>> > > the peer is allowed to just come back up).
> >>> > >
> >>> > >
> >>> > > >
> >>> > > >
> >>> > > > Perhaps a related but separate notion
> >>> > would be the
> >>> > > ability to
> >>> > > > have
> >>> > > > clustered controllers for HA.
> >>> > > >
> >>> > > >
> >>> > > > I do have a model in mind for this sort of thing,
> >>> > though not
> >>> > > multiple
> >>> > > > *controllers*, rather multiple Schedulers. Our
> >>> > design with
> >>> > > 0MQ would
> >>> > > > make this pretty simple (just start another
> >>> > scheduler, and
> >>> > > make an
> >>> > > > extra call to socket.connect() on the Client and
> >>> > Engine is
> >>> > > all that's
> >>> > > > needed), and this should allow scaling to tens of
> >>> > thousands
> >>> > > of
> >>> > > > engines.
> >>> > >
> >>> > >
> >>> > > Yes! That's what I'm after. In this cloud-scale age
> >>> > of
> >>> > > computing, that
> >>> > > would be ideal.
> >>> > >
> >>> > >
> >>> > > Thanks Min.
> >>> > >
> >>> > > >
> >>> > > >
> >>> > > > On Sun, 2012-02-12 at 08:32 -0800, Min RK
> >>> > wrote:
> >>> > > > > No, there is no failover mechanism.
> >>> > When the
> >>> > > controller
> >>> > > > goes down, further requests will simply
> >>> > hang. We
> >>> > > have almost
> >>> > > > all the information we need to bring up a
> >>> > new
> >>> > > controller in
> >>> > > > its place (restart it), in which case the
> >>> > Client
> >>> > > wouldn't even
> >>> > > > need to know that it went down, and would
> >>> > continue
> >>> > > to just
> >>> > > > work, thanks to some zeromq magic.
> >>> > > > >
> >>> > > > > -MinRK
> >>> > > > >
> >>> > > > > On Feb 12, 2012, at 5:02, Darren Govoni
> >>> > > > <darren@ontrenet.com> wrote:
> >>> > > > >
> >>> > > > > > Hi,
> >>> > > > > > Does ipython support any kind of
> >>> > clustering or
> >>> > > failover
> >>> > > > for
> >>> > > > > > ipcontrollers? I'm wondering how
> >>> > situations are
> >>> > > handled
> >>> > > > where a
> >>> > > > > > controller goes down when a client
> >>> > needs to
> >>> > > perform
> >>> > > > something.
> >>> > > > > >
> >>> > > > > > thanks for any tips.
> >>> > > > > > Darren
> >>> > > > > >
> >>> > > > > >
> >>> > _______________________________________________
> >>> > > > > > IPython-User mailing list
> >>> > > > > > IPython-User@scipy.org
> >>> > > > > >
> >>> > >
> http://mail.scipy.org/mailman/listinfo/ipython-user
> >>> > > > >
> >>> > _______________________________________________
> >>> > > > > IPython-User mailing list
> >>> > > > > IPython-User@scipy.org
> >>> > > > >
> >>> > >
> http://mail.scipy.org/mailman/listinfo/ipython-user
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > _______________________________________________
> >>> > > > IPython-User mailing list
> >>> > > > IPython-User@scipy.org
> >>> > > >
> >>> > http://mail.scipy.org/mailman/listinfo/ipython-user
> >>> > > >
> >>> > > >
> >>> > > > _______________________________________________
> >>> > > > IPython-User mailing list
> >>> > > > IPython-User@scipy.org
> >>> > > >
> >>> > http://mail.scipy.org/mailman/listinfo/ipython-user
> >>> > >
> >>> > >
> >>> > > _______________________________________________
> >>> > > IPython-User mailing list
> >>> > > IPython-User@scipy.org
> >>> > >
> http://mail.scipy.org/mailman/listinfo/ipython-user
> >>> > >
> >>> > >
> >>> > > _______________________________________________
> >>> > > IPython-User mailing list
> >>> > > IPython-User@scipy.org
> >>> > > http://mail.scipy.org/mailman/listinfo/ipython-user
> >>> >
> >>> >
> >>> > _______________________________________________
> >>> > IPython-User mailing list
> >>> > IPython-User@scipy.org
> >>> > http://mail.scipy.org/mailman/listinfo/ipython-user
> >>> >
> >>> >
> >>> > _______________________________________________
> >>> > IPython-User mailing list
> >>> > IPython-User@scipy.org
> >>> > http://mail.scipy.org/mailman/listinfo/ipython-user
> >>>
> >>>
> >>> _______________________________________________
> >>> IPython-User mailing list
> >>> IPython-User@scipy.org
> >>> http://mail.scipy.org/mailman/listinfo/ipython-user
> >>>
> >>
> >>
> > _______________________________________________
> > IPython-User mailing list
> > IPython-User@scipy.org
> > http://mail.scipy.org/mailman/listinfo/ipython-user
> >
>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20120306/81738929/attachment-0001.html
More information about the IPython-User
mailing list