[IPython-User] ipcontroller failover?

MinRK benjaminrk@gmail....
Mon Mar 5 18:38:30 CST 2012


Correct, engines do not reconnect to a new controller, and right now a
Controller is a single point of failure.

We absolutely do intend to enable restarting the controller, and it
wouldn't be remotely difficult, the code just isn't written yet.

Steps required for this:

1. persist engine connection state to files/db (engine ID/UUID mapping
should)
2. when starting up, load this information into the Hub, instead of
starting from scratch

That is all.  No change should be required in the engines or clients, as
zeromq handles the reconnect automagically.

There is already enough information stored in the *task* database to resume
all tasks that were waiting in the Scheduler, but I'm not sure whether this
should be done by default, or only on request.

-MinRK

On Mon, Mar 5, 2012 at 15:17, Darren Govoni <darren@ontrenet.com> wrote:

> Hi,
>
> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
> > It may also be unnecessary, because if the controller comes up at the
> > same endpoint(s), then zeromq handles all the reconnects invisibly.  A
> > connection to an endpoint is always valid, whether or not there is a
> > socket present at any given point in time.
>
>   I tried an example to see this. I ran an ipcontroller on one machine
> with static --port=21001 so engine client files would always be valid.
>

Just specifying the registration port isn't enough information, and you
should be using `--reuse` or `IPControllerApp.reuse_files=True` for
connection files to remain valid across sessions.


>
> I connected one engine from another server.
>
> I killed the controller and restarted it.
>
> After doing:
>
> client = Client()
> client.ids
> []
>
> There are no longer any engines connected.
>
> dview = client[:]
> ...
> NoEnginesRegistered: Can't build targets without any engines
>
> The problem perhaps is that for any large scale system, say 1 controller
> with 50 engines running on 50 servers, this single-point-of-failure is
> hard to remedy.
>
> Is there a way to tell the controller to reconnect to last known engine
> IP addresses? Or some other way to re-establish the grid? Rebooting 50
> servers is not a good option for us.
>
> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
> >
> >
> > On Sun, Feb 12, 2012 at 13:02, Darren Govoni <darren@ontrenet.com>
> > wrote:
> >         Correct me if I'm wrong, but do the ipengines 'connect' or
> >         otherwise
> >         announce their presence to the controller?
> >
> >
> > Yes, 100% of the connections are inbound to the controller processes,
> > from clients and engines alike.  This is a strict requirement, because
> > it would not be acceptable for engines to need open ports for inbound
> > connections.  Simply bringing up a new controller with the same
> > connection information would result in the cluster continuing to
> > function, with the engines and client never realizing the controller
> > went down at all, nor having to act on it in any way.
> >
> >         If it were the other way
> >         around, then this would accommodate some degree of fault
> >         tolerance for
> >         the controller because it could be restarted by a watching dog
> >         and the
> >         re-establish the connected state of the cluster. i.e. a
> >         controller comes
> >         online. a pub/sub message is sent to a known channel and
> >         clients or
> >         engines add the new ipcontroller to its internal list as a
> >         failover
> >         endpoint.
> >
> >
> > This is still possible without reversing connection direction.  Note
> > that in zeromq there is *exactly zero* correlation between
> > communication direction and connection direction.  PUB can connect to
> > SUB, and vice versa.  In fact a single socket can bind and connect at
> > the same time.
> >
> >
> > It may also be unnecessary, because if the controller comes up at the
> > same endpoint(s), then zeromq handles all the reconnects invisibly.  A
> > connection to an endpoint is always valid, whether or not there is a
> > socket present at any given point in time.
> >
> >
> >         On Sun, 2012-02-12 at 12:06 -0800, MinRK wrote:
> >         >
> >         >
> >         > On Sun, Feb 12, 2012 at 11:48, Darren Govoni
> >         <darren@ontrenet.com>
> >         > wrote:
> >         >         On Sun, 2012-02-12 at 11:12 -0800, MinRK wrote:
> >         >         >
> >         >         >
> >         >         > On Sun, Feb 12, 2012 at 10:42, Darren Govoni
> >         >         <darren@ontrenet.com>
> >         >         > wrote:
> >         >         >         Thanks Min,
> >         >         >
> >         >         >         Is it possible to open a ticket for this
> >         capability
> >         >         for a
> >         >         >         (near) future
> >         >         >         release? It compliments that already
> >         amazing load
> >         >         balancing
> >         >         >         capability.
> >         >         >
> >         >         >
> >         >         > You are welcome to open an Issue.  I don't know if
> >         it will
> >         >         make it
> >         >         > into one of the next few releases, but it is on my
> >         todo
> >         >         list.  The
> >         >         > best way to get this sort of thing going is to
> >         start with a
> >         >         Pull
> >         >         > Request.
> >         >
> >         >
> >         >         Ok, I will open an issue. Thanks. In the meantime,
> >         is it
> >         >         possible for
> >         >         clients to 'know' when a controller is no longer
> >         available?
> >         >         For example,
> >         >         it would be nice if I can insert a callback handler
> >         for this
> >         >         sort of
> >         >         internal exception so I can provide some graceful
> >         recovery
> >         >         options.
> >         >
> >         >
> >         > It would be sensible to add a heartbeat mechanism on the
> >         > controller->client PUB channel for this information.  Until
> >         then, your
> >         > main controller crash detection is going to be simple
> >         timeouts.
> >         >
> >         >
> >         > ZeroMQ makes disconnect detection a challenge (because there
> >         are no
> >         > disconnect events, because a disconnected channel is still
> >         valid, as
> >         > the peer is allowed to just come back up).
> >         >
> >         >
> >         >         >
> >         >         >
> >         >         >         Perhaps a related but separate notion
> >         would be the
> >         >         ability to
> >         >         >         have
> >         >         >         clustered controllers for HA.
> >         >         >
> >         >         >
> >         >         > I do have a model in mind for this sort of thing,
> >         though not
> >         >         multiple
> >         >         > *controllers*, rather multiple Schedulers.  Our
> >         design with
> >         >         0MQ would
> >         >         > make this pretty simple (just start another
> >         scheduler, and
> >         >         make an
> >         >         > extra call to socket.connect() on the Client and
> >         Engine is
> >         >         all that's
> >         >         > needed), and this should allow scaling to tens of
> >         thousands
> >         >         of
> >         >         > engines.
> >         >
> >         >
> >         >         Yes! That's what I'm after. In this cloud-scale age
> >         of
> >         >         computing, that
> >         >         would be ideal.
> >         >
> >         >
> >         >         Thanks Min.
> >         >
> >         >         >
> >         >         >
> >         >         >         On Sun, 2012-02-12 at 08:32 -0800, Min RK
> >         wrote:
> >         >         >         > No, there is no failover mechanism.
> >          When the
> >         >         controller
> >         >         >         goes down, further requests will simply
> >         hang.  We
> >         >         have almost
> >         >         >         all the information we need to bring up a
> >         new
> >         >         controller in
> >         >         >         its place (restart it), in which case the
> >         Client
> >         >         wouldn't even
> >         >         >         need to know that it went down, and would
> >         continue
> >         >         to just
> >         >         >         work, thanks to some zeromq magic.
> >         >         >         >
> >         >         >         > -MinRK
> >         >         >         >
> >         >         >         > On Feb 12, 2012, at 5:02, Darren Govoni
> >         >         >         <darren@ontrenet.com> wrote:
> >         >         >         >
> >         >         >         > > Hi,
> >         >         >         > >  Does ipython support any kind of
> >         clustering or
> >         >         failover
> >         >         >         for
> >         >         >         > > ipcontrollers? I'm wondering how
> >         situations are
> >         >         handled
> >         >         >         where a
> >         >         >         > > controller goes down when a client
> >         needs to
> >         >         perform
> >         >         >         something.
> >         >         >         > >
> >         >         >         > > thanks for any tips.
> >         >         >         > > Darren
> >         >         >         > >
> >         >         >         > >
> >         _______________________________________________
> >         >         >         > > IPython-User mailing list
> >         >         >         > > IPython-User@scipy.org
> >         >         >         > >
> >         >         http://mail.scipy.org/mailman/listinfo/ipython-user
> >         >         >         >
> >         _______________________________________________
> >         >         >         > IPython-User mailing list
> >         >         >         > IPython-User@scipy.org
> >         >         >         >
> >         >         http://mail.scipy.org/mailman/listinfo/ipython-user
> >         >         >
> >         >         >
> >         >         >
> >         _______________________________________________
> >         >         >         IPython-User mailing list
> >         >         >         IPython-User@scipy.org
> >         >         >
> >         http://mail.scipy.org/mailman/listinfo/ipython-user
> >         >         >
> >         >         >
> >         >         > _______________________________________________
> >         >         > IPython-User mailing list
> >         >         > IPython-User@scipy.org
> >         >         >
> >         http://mail.scipy.org/mailman/listinfo/ipython-user
> >         >
> >         >
> >         >         _______________________________________________
> >         >         IPython-User mailing list
> >         >         IPython-User@scipy.org
> >         >         http://mail.scipy.org/mailman/listinfo/ipython-user
> >         >
> >         >
> >         > _______________________________________________
> >         > IPython-User mailing list
> >         > IPython-User@scipy.org
> >         > http://mail.scipy.org/mailman/listinfo/ipython-user
> >
> >
> >         _______________________________________________
> >         IPython-User mailing list
> >         IPython-User@scipy.org
> >         http://mail.scipy.org/mailman/listinfo/ipython-user
> >
> >
> > _______________________________________________
> > IPython-User mailing list
> > IPython-User@scipy.org
> > http://mail.scipy.org/mailman/listinfo/ipython-user
>
>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20120305/d0e431c3/attachment-0001.html 


More information about the IPython-User mailing list