[IPython-User] ipcontroller failover?

Darren Govoni darren@ontrenet....
Mon Mar 5 17:17:05 CST 2012


Hi,

On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
> It may also be unnecessary, because if the controller comes up at the
> same endpoint(s), then zeromq handles all the reconnects invisibly.  A
> connection to an endpoint is always valid, whether or not there is a
> socket present at any given point in time.

  I tried an example to see this. I ran an ipcontroller on one machine
with static --port=21001 so engine client files would always be valid.

I connected one engine from another server.

I killed the controller and restarted it.

After doing:

client = Client()
client.ids
[]

There are no longer any engines connected. 

dview = client[:]
...
NoEnginesRegistered: Can't build targets without any engines

The problem perhaps is that for any large scale system, say 1 controller
with 50 engines running on 50 servers, this single-point-of-failure is
hard to remedy. 

Is there a way to tell the controller to reconnect to last known engine
IP addresses? Or some other way to re-establish the grid? Rebooting 50
servers is not a good option for us.

On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
> 
> 
> On Sun, Feb 12, 2012 at 13:02, Darren Govoni <darren@ontrenet.com>
> wrote:
>         Correct me if I'm wrong, but do the ipengines 'connect' or
>         otherwise
>         announce their presence to the controller?
> 
> 
> Yes, 100% of the connections are inbound to the controller processes,
> from clients and engines alike.  This is a strict requirement, because
> it would not be acceptable for engines to need open ports for inbound
> connections.  Simply bringing up a new controller with the same
> connection information would result in the cluster continuing to
> function, with the engines and client never realizing the controller
> went down at all, nor having to act on it in any way.
>  
>         If it were the other way
>         around, then this would accommodate some degree of fault
>         tolerance for
>         the controller because it could be restarted by a watching dog
>         and the
>         re-establish the connected state of the cluster. i.e. a
>         controller comes
>         online. a pub/sub message is sent to a known channel and
>         clients or
>         engines add the new ipcontroller to its internal list as a
>         failover
>         endpoint.
> 
> 
> This is still possible without reversing connection direction.  Note
> that in zeromq there is *exactly zero* correlation between
> communication direction and connection direction.  PUB can connect to
> SUB, and vice versa.  In fact a single socket can bind and connect at
> the same time.
> 
> 
> It may also be unnecessary, because if the controller comes up at the
> same endpoint(s), then zeromq handles all the reconnects invisibly.  A
> connection to an endpoint is always valid, whether or not there is a
> socket present at any given point in time.
>  
>         
>         On Sun, 2012-02-12 at 12:06 -0800, MinRK wrote:
>         >
>         >
>         > On Sun, Feb 12, 2012 at 11:48, Darren Govoni
>         <darren@ontrenet.com>
>         > wrote:
>         >         On Sun, 2012-02-12 at 11:12 -0800, MinRK wrote:
>         >         >
>         >         >
>         >         > On Sun, Feb 12, 2012 at 10:42, Darren Govoni
>         >         <darren@ontrenet.com>
>         >         > wrote:
>         >         >         Thanks Min,
>         >         >
>         >         >         Is it possible to open a ticket for this
>         capability
>         >         for a
>         >         >         (near) future
>         >         >         release? It compliments that already
>         amazing load
>         >         balancing
>         >         >         capability.
>         >         >
>         >         >
>         >         > You are welcome to open an Issue.  I don't know if
>         it will
>         >         make it
>         >         > into one of the next few releases, but it is on my
>         todo
>         >         list.  The
>         >         > best way to get this sort of thing going is to
>         start with a
>         >         Pull
>         >         > Request.
>         >
>         >
>         >         Ok, I will open an issue. Thanks. In the meantime,
>         is it
>         >         possible for
>         >         clients to 'know' when a controller is no longer
>         available?
>         >         For example,
>         >         it would be nice if I can insert a callback handler
>         for this
>         >         sort of
>         >         internal exception so I can provide some graceful
>         recovery
>         >         options.
>         >
>         >
>         > It would be sensible to add a heartbeat mechanism on the
>         > controller->client PUB channel for this information.  Until
>         then, your
>         > main controller crash detection is going to be simple
>         timeouts.
>         >
>         >
>         > ZeroMQ makes disconnect detection a challenge (because there
>         are no
>         > disconnect events, because a disconnected channel is still
>         valid, as
>         > the peer is allowed to just come back up).
>         >
>         >
>         >         >
>         >         >
>         >         >         Perhaps a related but separate notion
>         would be the
>         >         ability to
>         >         >         have
>         >         >         clustered controllers for HA.
>         >         >
>         >         >
>         >         > I do have a model in mind for this sort of thing,
>         though not
>         >         multiple
>         >         > *controllers*, rather multiple Schedulers.  Our
>         design with
>         >         0MQ would
>         >         > make this pretty simple (just start another
>         scheduler, and
>         >         make an
>         >         > extra call to socket.connect() on the Client and
>         Engine is
>         >         all that's
>         >         > needed), and this should allow scaling to tens of
>         thousands
>         >         of
>         >         > engines.
>         >
>         >
>         >         Yes! That's what I'm after. In this cloud-scale age
>         of
>         >         computing, that
>         >         would be ideal.
>         >
>         >
>         >         Thanks Min.
>         >
>         >         >
>         >         >
>         >         >         On Sun, 2012-02-12 at 08:32 -0800, Min RK
>         wrote:
>         >         >         > No, there is no failover mechanism.
>          When the
>         >         controller
>         >         >         goes down, further requests will simply
>         hang.  We
>         >         have almost
>         >         >         all the information we need to bring up a
>         new
>         >         controller in
>         >         >         its place (restart it), in which case the
>         Client
>         >         wouldn't even
>         >         >         need to know that it went down, and would
>         continue
>         >         to just
>         >         >         work, thanks to some zeromq magic.
>         >         >         >
>         >         >         > -MinRK
>         >         >         >
>         >         >         > On Feb 12, 2012, at 5:02, Darren Govoni
>         >         >         <darren@ontrenet.com> wrote:
>         >         >         >
>         >         >         > > Hi,
>         >         >         > >  Does ipython support any kind of
>         clustering or
>         >         failover
>         >         >         for
>         >         >         > > ipcontrollers? I'm wondering how
>         situations are
>         >         handled
>         >         >         where a
>         >         >         > > controller goes down when a client
>         needs to
>         >         perform
>         >         >         something.
>         >         >         > >
>         >         >         > > thanks for any tips.
>         >         >         > > Darren
>         >         >         > >
>         >         >         > >
>         _______________________________________________
>         >         >         > > IPython-User mailing list
>         >         >         > > IPython-User@scipy.org
>         >         >         > >
>         >         http://mail.scipy.org/mailman/listinfo/ipython-user
>         >         >         >
>         _______________________________________________
>         >         >         > IPython-User mailing list
>         >         >         > IPython-User@scipy.org
>         >         >         >
>         >         http://mail.scipy.org/mailman/listinfo/ipython-user
>         >         >
>         >         >
>         >         >
>         _______________________________________________
>         >         >         IPython-User mailing list
>         >         >         IPython-User@scipy.org
>         >         >
>         http://mail.scipy.org/mailman/listinfo/ipython-user
>         >         >
>         >         >
>         >         > _______________________________________________
>         >         > IPython-User mailing list
>         >         > IPython-User@scipy.org
>         >         >
>         http://mail.scipy.org/mailman/listinfo/ipython-user
>         >
>         >
>         >         _______________________________________________
>         >         IPython-User mailing list
>         >         IPython-User@scipy.org
>         >         http://mail.scipy.org/mailman/listinfo/ipython-user
>         >
>         >
>         > _______________________________________________
>         > IPython-User mailing list
>         > IPython-User@scipy.org
>         > http://mail.scipy.org/mailman/listinfo/ipython-user
>         
>         
>         _______________________________________________
>         IPython-User mailing list
>         IPython-User@scipy.org
>         http://mail.scipy.org/mailman/listinfo/ipython-user
>         
> 
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user




More information about the IPython-User mailing list