[IPython-User] ipcontroller failover?
Darren Govoni
darren@ontrenet....
Mon Mar 5 17:17:05 CST 2012
Hi,
On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
> It may also be unnecessary, because if the controller comes up at the
> same endpoint(s), then zeromq handles all the reconnects invisibly. A
> connection to an endpoint is always valid, whether or not there is a
> socket present at any given point in time.
I tried an example to see this. I ran an ipcontroller on one machine
with static --port=21001 so engine client files would always be valid.
I connected one engine from another server.
I killed the controller and restarted it.
After doing:
client = Client()
client.ids
[]
There are no longer any engines connected.
dview = client[:]
...
NoEnginesRegistered: Can't build targets without any engines
The problem perhaps is that for any large scale system, say 1 controller
with 50 engines running on 50 servers, this single-point-of-failure is
hard to remedy.
Is there a way to tell the controller to reconnect to last known engine
IP addresses? Or some other way to re-establish the grid? Rebooting 50
servers is not a good option for us.
On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
>
>
> On Sun, Feb 12, 2012 at 13:02, Darren Govoni <darren@ontrenet.com>
> wrote:
> Correct me if I'm wrong, but do the ipengines 'connect' or
> otherwise
> announce their presence to the controller?
>
>
> Yes, 100% of the connections are inbound to the controller processes,
> from clients and engines alike. This is a strict requirement, because
> it would not be acceptable for engines to need open ports for inbound
> connections. Simply bringing up a new controller with the same
> connection information would result in the cluster continuing to
> function, with the engines and client never realizing the controller
> went down at all, nor having to act on it in any way.
>
> If it were the other way
> around, then this would accommodate some degree of fault
> tolerance for
> the controller because it could be restarted by a watching dog
> and the
> re-establish the connected state of the cluster. i.e. a
> controller comes
> online. a pub/sub message is sent to a known channel and
> clients or
> engines add the new ipcontroller to its internal list as a
> failover
> endpoint.
>
>
> This is still possible without reversing connection direction. Note
> that in zeromq there is *exactly zero* correlation between
> communication direction and connection direction. PUB can connect to
> SUB, and vice versa. In fact a single socket can bind and connect at
> the same time.
>
>
> It may also be unnecessary, because if the controller comes up at the
> same endpoint(s), then zeromq handles all the reconnects invisibly. A
> connection to an endpoint is always valid, whether or not there is a
> socket present at any given point in time.
>
>
> On Sun, 2012-02-12 at 12:06 -0800, MinRK wrote:
> >
> >
> > On Sun, Feb 12, 2012 at 11:48, Darren Govoni
> <darren@ontrenet.com>
> > wrote:
> > On Sun, 2012-02-12 at 11:12 -0800, MinRK wrote:
> > >
> > >
> > > On Sun, Feb 12, 2012 at 10:42, Darren Govoni
> > <darren@ontrenet.com>
> > > wrote:
> > > Thanks Min,
> > >
> > > Is it possible to open a ticket for this
> capability
> > for a
> > > (near) future
> > > release? It compliments that already
> amazing load
> > balancing
> > > capability.
> > >
> > >
> > > You are welcome to open an Issue. I don't know if
> it will
> > make it
> > > into one of the next few releases, but it is on my
> todo
> > list. The
> > > best way to get this sort of thing going is to
> start with a
> > Pull
> > > Request.
> >
> >
> > Ok, I will open an issue. Thanks. In the meantime,
> is it
> > possible for
> > clients to 'know' when a controller is no longer
> available?
> > For example,
> > it would be nice if I can insert a callback handler
> for this
> > sort of
> > internal exception so I can provide some graceful
> recovery
> > options.
> >
> >
> > It would be sensible to add a heartbeat mechanism on the
> > controller->client PUB channel for this information. Until
> then, your
> > main controller crash detection is going to be simple
> timeouts.
> >
> >
> > ZeroMQ makes disconnect detection a challenge (because there
> are no
> > disconnect events, because a disconnected channel is still
> valid, as
> > the peer is allowed to just come back up).
> >
> >
> > >
> > >
> > > Perhaps a related but separate notion
> would be the
> > ability to
> > > have
> > > clustered controllers for HA.
> > >
> > >
> > > I do have a model in mind for this sort of thing,
> though not
> > multiple
> > > *controllers*, rather multiple Schedulers. Our
> design with
> > 0MQ would
> > > make this pretty simple (just start another
> scheduler, and
> > make an
> > > extra call to socket.connect() on the Client and
> Engine is
> > all that's
> > > needed), and this should allow scaling to tens of
> thousands
> > of
> > > engines.
> >
> >
> > Yes! That's what I'm after. In this cloud-scale age
> of
> > computing, that
> > would be ideal.
> >
> >
> > Thanks Min.
> >
> > >
> > >
> > > On Sun, 2012-02-12 at 08:32 -0800, Min RK
> wrote:
> > > > No, there is no failover mechanism.
> When the
> > controller
> > > goes down, further requests will simply
> hang. We
> > have almost
> > > all the information we need to bring up a
> new
> > controller in
> > > its place (restart it), in which case the
> Client
> > wouldn't even
> > > need to know that it went down, and would
> continue
> > to just
> > > work, thanks to some zeromq magic.
> > > >
> > > > -MinRK
> > > >
> > > > On Feb 12, 2012, at 5:02, Darren Govoni
> > > <darren@ontrenet.com> wrote:
> > > >
> > > > > Hi,
> > > > > Does ipython support any kind of
> clustering or
> > failover
> > > for
> > > > > ipcontrollers? I'm wondering how
> situations are
> > handled
> > > where a
> > > > > controller goes down when a client
> needs to
> > perform
> > > something.
> > > > >
> > > > > thanks for any tips.
> > > > > Darren
> > > > >
> > > > >
> _______________________________________________
> > > > > IPython-User mailing list
> > > > > IPython-User@scipy.org
> > > > >
> > http://mail.scipy.org/mailman/listinfo/ipython-user
> > > >
> _______________________________________________
> > > > IPython-User mailing list
> > > > IPython-User@scipy.org
> > > >
> > http://mail.scipy.org/mailman/listinfo/ipython-user
> > >
> > >
> > >
> _______________________________________________
> > > IPython-User mailing list
> > > IPython-User@scipy.org
> > >
> http://mail.scipy.org/mailman/listinfo/ipython-user
> > >
> > >
> > > _______________________________________________
> > > IPython-User mailing list
> > > IPython-User@scipy.org
> > >
> http://mail.scipy.org/mailman/listinfo/ipython-user
> >
> >
> > _______________________________________________
> > IPython-User mailing list
> > IPython-User@scipy.org
> > http://mail.scipy.org/mailman/listinfo/ipython-user
> >
> >
> > _______________________________________________
> > IPython-User mailing list
> > IPython-User@scipy.org
> > http://mail.scipy.org/mailman/listinfo/ipython-user
>
>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
More information about the IPython-User
mailing list