[IPython-User] ipcontroller failover?

MinRK benjaminrk@gmail....
Sun Feb 12 15:19:55 CST 2012


On Sun, Feb 12, 2012 at 13:02, Darren Govoni <darren@ontrenet.com> wrote:

> Correct me if I'm wrong, but do the ipengines 'connect' or otherwise
> announce their presence to the controller?


Yes, 100% of the connections are inbound to the controller processes, from
clients and engines alike.  This is a strict requirement, because it would
not be acceptable for engines to need open ports for inbound connections.
 Simply bringing up a new controller with the same connection information
would result in the cluster continuing to function, with the engines and
client never realizing the controller went down at all, nor having to act
on it in any way.


> If it were the other way
> around, then this would accommodate some degree of fault tolerance for
> the controller because it could be restarted by a watching dog and the
> re-establish the connected state of the cluster. i.e. a controller comes
> online. a pub/sub message is sent to a known channel and clients or
> engines add the new ipcontroller to its internal list as a failover
> endpoint.
>

This is still possible without reversing connection direction.  Note that
in zeromq there is *exactly zero* correlation between communication
direction and connection direction.  PUB can connect to SUB, and vice
versa.  In fact a single socket can bind and connect at the same time.

It may also be unnecessary, because if the controller comes up at the same
endpoint(s), then zeromq handles all the reconnects invisibly.  A
connection to an endpoint is always valid, whether or not there is a socket
present at any given point in time.


>
> On Sun, 2012-02-12 at 12:06 -0800, MinRK wrote:
> >
> >
> > On Sun, Feb 12, 2012 at 11:48, Darren Govoni <darren@ontrenet.com>
> > wrote:
> >         On Sun, 2012-02-12 at 11:12 -0800, MinRK wrote:
> >         >
> >         >
> >         > On Sun, Feb 12, 2012 at 10:42, Darren Govoni
> >         <darren@ontrenet.com>
> >         > wrote:
> >         >         Thanks Min,
> >         >
> >         >         Is it possible to open a ticket for this capability
> >         for a
> >         >         (near) future
> >         >         release? It compliments that already amazing load
> >         balancing
> >         >         capability.
> >         >
> >         >
> >         > You are welcome to open an Issue.  I don't know if it will
> >         make it
> >         > into one of the next few releases, but it is on my todo
> >         list.  The
> >         > best way to get this sort of thing going is to start with a
> >         Pull
> >         > Request.
> >
> >
> >         Ok, I will open an issue. Thanks. In the meantime, is it
> >         possible for
> >         clients to 'know' when a controller is no longer available?
> >         For example,
> >         it would be nice if I can insert a callback handler for this
> >         sort of
> >         internal exception so I can provide some graceful recovery
> >         options.
> >
> >
> > It would be sensible to add a heartbeat mechanism on the
> > controller->client PUB channel for this information.  Until then, your
> > main controller crash detection is going to be simple timeouts.
> >
> >
> > ZeroMQ makes disconnect detection a challenge (because there are no
> > disconnect events, because a disconnected channel is still valid, as
> > the peer is allowed to just come back up).
> >
> >
> >         >
> >         >
> >         >         Perhaps a related but separate notion would be the
> >         ability to
> >         >         have
> >         >         clustered controllers for HA.
> >         >
> >         >
> >         > I do have a model in mind for this sort of thing, though not
> >         multiple
> >         > *controllers*, rather multiple Schedulers.  Our design with
> >         0MQ would
> >         > make this pretty simple (just start another scheduler, and
> >         make an
> >         > extra call to socket.connect() on the Client and Engine is
> >         all that's
> >         > needed), and this should allow scaling to tens of thousands
> >         of
> >         > engines.
> >
> >
> >         Yes! That's what I'm after. In this cloud-scale age of
> >         computing, that
> >         would be ideal.
> >
> >
> >         Thanks Min.
> >
> >         >
> >         >
> >         >         On Sun, 2012-02-12 at 08:32 -0800, Min RK wrote:
> >         >         > No, there is no failover mechanism.  When the
> >         controller
> >         >         goes down, further requests will simply hang.  We
> >         have almost
> >         >         all the information we need to bring up a new
> >         controller in
> >         >         its place (restart it), in which case the Client
> >         wouldn't even
> >         >         need to know that it went down, and would continue
> >         to just
> >         >         work, thanks to some zeromq magic.
> >         >         >
> >         >         > -MinRK
> >         >         >
> >         >         > On Feb 12, 2012, at 5:02, Darren Govoni
> >         >         <darren@ontrenet.com> wrote:
> >         >         >
> >         >         > > Hi,
> >         >         > >  Does ipython support any kind of clustering or
> >         failover
> >         >         for
> >         >         > > ipcontrollers? I'm wondering how situations are
> >         handled
> >         >         where a
> >         >         > > controller goes down when a client needs to
> >         perform
> >         >         something.
> >         >         > >
> >         >         > > thanks for any tips.
> >         >         > > Darren
> >         >         > >
> >         >         > > _______________________________________________
> >         >         > > IPython-User mailing list
> >         >         > > IPython-User@scipy.org
> >         >         > >
> >         http://mail.scipy.org/mailman/listinfo/ipython-user
> >         >         > _______________________________________________
> >         >         > IPython-User mailing list
> >         >         > IPython-User@scipy.org
> >         >         >
> >         http://mail.scipy.org/mailman/listinfo/ipython-user
> >         >
> >         >
> >         >         _______________________________________________
> >         >         IPython-User mailing list
> >         >         IPython-User@scipy.org
> >         >         http://mail.scipy.org/mailman/listinfo/ipython-user
> >         >
> >         >
> >         > _______________________________________________
> >         > IPython-User mailing list
> >         > IPython-User@scipy.org
> >         > http://mail.scipy.org/mailman/listinfo/ipython-user
> >
> >
> >         _______________________________________________
> >         IPython-User mailing list
> >         IPython-User@scipy.org
> >         http://mail.scipy.org/mailman/listinfo/ipython-user
> >
> >
> > _______________________________________________
> > IPython-User mailing list
> > IPython-User@scipy.org
> > http://mail.scipy.org/mailman/listinfo/ipython-user
>
>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20120212/b9063647/attachment.html 


More information about the IPython-User mailing list