[IPython-User] ipcontroller failover?

darren@ontrene... darren@ontrene...
Tue Mar 6 16:23:48 CST 2012


Sure.

We're developing a cloud-based software system that processes
documents/files etc. It is currently built around Amazon cloud APIs but we
want to lose that dependency. So we have a single portal server that
provides the user experience and acts as "controller" of the other
servers. They can launch up to 100 virtual servers and then assign those
servers to queues.

Work messages are sent to the queues and each server fetches one message.
Amazon has a queue service for this, but its not as fast as ipython. It
is, however, fault tolerant.

We want to move that internally to ipython with its nice load balancing
features. The portal server will house the controller and each possible
server (up to 100) will have 1 or more engines connected to it.

One aspect of our design is that it must accommodate hardware failures.
Currently any of the worker servers can just "disappear" without affecting
the outcome. Likewise, new ones can emerge and help with the work.

The portal server can also be rebooted or relaunched if necessary because
all the cloud data is "in the cloud".

Since our portal acts as the "head" of the system, if it runs ipcontroller
and is rebooted (which is allowed), then all 100 servers get confused and
won't know to reconnect. I can write some logic to force this, but seems
easier for ipcontroller to remember itself. Glad to see its a coming
feature!

> Can I ask more about what your environment is like, and the typical
> circumstances of controller shutdown / crash?
>
> How often does the controller die, how many tasks are pending in the
> Schedulers, and how many are active on engines when this happens?  What
> are
> your expectations/hopes/dreams for behavior if the controller goes down
> while a bunch of work is in-flight?
>
> -MinRK
>
> On Tue, Mar 6, 2012 at 13:20, <darren@ontrenet.com> wrote:
>
>> Wow. Awesome. Let me try it. Many thanks.
>>
>> > You might check out this first-go implementation:
>> >
>> > https://github.com/ipython/ipython/pull/1471
>> >
>> > It seems to work fine if the cluster was idle at controller crash, but
>> I
>> > haven't tested the behavior of running jobs.  I'm certain that the
>> > propagation of results of jobs submitted before shutdown all the way
>> up
>> to
>> > interactive Clients is broken, but the results should still arrive in
>> the
>> > Hub's db.
>> >
>> > -MinRK
>> >
>> >
>> > On Mon, Mar 5, 2012 at 16:38, MinRK <benjaminrk@gmail.com> wrote:
>> >
>> >> Correct, engines do not reconnect to a new controller, and right now
>> a
>> >> Controller is a single point of failure.
>> >>
>> >> We absolutely do intend to enable restarting the controller, and it
>> >> wouldn't be remotely difficult, the code just isn't written yet.
>> >>
>> >> Steps required for this:
>> >>
>> >> 1. persist engine connection state to files/db (engine ID/UUID
>> mapping
>> >> should)
>> >> 2. when starting up, load this information into the Hub, instead of
>> >> starting from scratch
>> >>
>> >> That is all.  No change should be required in the engines or clients,
>> as
>> >> zeromq handles the reconnect automagically.
>> >>
>> >> There is already enough information stored in the *task* database to
>> >> resume all tasks that were waiting in the Scheduler, but I'm not sure
>> >> whether this should be done by default, or only on request.
>> >>
>> >> -MinRK
>> >>
>> >> On Mon, Mar 5, 2012 at 15:17, Darren Govoni <darren@ontrenet.com>
>> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
>> >>> > It may also be unnecessary, because if the controller comes up at
>> the
>> >>> > same endpoint(s), then zeromq handles all the reconnects
>> invisibly.
>> >>> A
>> >>> > connection to an endpoint is always valid, whether or not there is
>> a
>> >>> > socket present at any given point in time.
>> >>>
>> >>>   I tried an example to see this. I ran an ipcontroller on one
>> machine
>> >>> with static --port=21001 so engine client files would always be
>> valid.
>> >>>
>> >>
>> >> Just specifying the registration port isn't enough information, and
>> you
>> >> should be using `--reuse` or `IPControllerApp.reuse_files=True` for
>> >> connection files to remain valid across sessions.
>> >>
>> >>
>> >>>
>> >>> I connected one engine from another server.
>> >>>
>> >>> I killed the controller and restarted it.
>> >>>
>> >>> After doing:
>> >>>
>> >>> client = Client()
>> >>> client.ids
>> >>> []
>> >>>
>> >>> There are no longer any engines connected.
>> >>>
>> >>> dview = client[:]
>> >>> ...
>> >>> NoEnginesRegistered: Can't build targets without any engines
>> >>>
>> >>> The problem perhaps is that for any large scale system, say 1
>> >>> controller
>> >>> with 50 engines running on 50 servers, this single-point-of-failure
>> is
>> >>> hard to remedy.
>> >>>
>> >>> Is there a way to tell the controller to reconnect to last known
>> engine
>> >>> IP addresses? Or some other way to re-establish the grid? Rebooting
>> 50
>> >>> servers is not a good option for us.
>> >>>
>> >>> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
>> >>> >
>> >>> >
>> >>> > On Sun, Feb 12, 2012 at 13:02, Darren Govoni <darren@ontrenet.com>
>> >>> > wrote:
>> >>> >         Correct me if I'm wrong, but do the ipengines 'connect' or
>> >>> >         otherwise
>> >>> >         announce their presence to the controller?
>> >>> >
>> >>> >
>> >>> > Yes, 100% of the connections are inbound to the controller
>> processes,
>> >>> > from clients and engines alike.  This is a strict requirement,
>> >>> because
>> >>> > it would not be acceptable for engines to need open ports for
>> inbound
>> >>> > connections.  Simply bringing up a new controller with the same
>> >>> > connection information would result in the cluster continuing to
>> >>> > function, with the engines and client never realizing the
>> controller
>> >>> > went down at all, nor having to act on it in any way.
>> >>> >
>> >>> >         If it were the other way
>> >>> >         around, then this would accommodate some degree of fault
>> >>> >         tolerance for
>> >>> >         the controller because it could be restarted by a watching
>> >>> dog
>> >>> >         and the
>> >>> >         re-establish the connected state of the cluster. i.e. a
>> >>> >         controller comes
>> >>> >         online. a pub/sub message is sent to a known channel and
>> >>> >         clients or
>> >>> >         engines add the new ipcontroller to its internal list as a
>> >>> >         failover
>> >>> >         endpoint.
>> >>> >
>> >>> >
>> >>> > This is still possible without reversing connection direction.
>> Note
>> >>> > that in zeromq there is *exactly zero* correlation between
>> >>> > communication direction and connection direction.  PUB can connect
>> to
>> >>> > SUB, and vice versa.  In fact a single socket can bind and connect
>> at
>> >>> > the same time.
>> >>> >
>> >>> >
>> >>> > It may also be unnecessary, because if the controller comes up at
>> the
>> >>> > same endpoint(s), then zeromq handles all the reconnects
>> invisibly.
>> >>> A
>> >>> > connection to an endpoint is always valid, whether or not there is
>> a
>> >>> > socket present at any given point in time.
>> >>> >
>> >>> >
>> >>> >         On Sun, 2012-02-12 at 12:06 -0800, MinRK wrote:
>> >>> >         >
>> >>> >         >
>> >>> >         > On Sun, Feb 12, 2012 at 11:48, Darren Govoni
>> >>> >         <darren@ontrenet.com>
>> >>> >         > wrote:
>> >>> >         >         On Sun, 2012-02-12 at 11:12 -0800, MinRK wrote:
>> >>> >         >         >
>> >>> >         >         >
>> >>> >         >         > On Sun, Feb 12, 2012 at 10:42, Darren Govoni
>> >>> >         >         <darren@ontrenet.com>
>> >>> >         >         > wrote:
>> >>> >         >         >         Thanks Min,
>> >>> >         >         >
>> >>> >         >         >         Is it possible to open a ticket for
>> this
>> >>> >         capability
>> >>> >         >         for a
>> >>> >         >         >         (near) future
>> >>> >         >         >         release? It compliments that already
>> >>> >         amazing load
>> >>> >         >         balancing
>> >>> >         >         >         capability.
>> >>> >         >         >
>> >>> >         >         >
>> >>> >         >         > You are welcome to open an Issue.  I don't
>> know
>> >>> if
>> >>> >         it will
>> >>> >         >         make it
>> >>> >         >         > into one of the next few releases, but it is
>> on
>> >>> my
>> >>> >         todo
>> >>> >         >         list.  The
>> >>> >         >         > best way to get this sort of thing going is to
>> >>> >         start with a
>> >>> >         >         Pull
>> >>> >         >         > Request.
>> >>> >         >
>> >>> >         >
>> >>> >         >         Ok, I will open an issue. Thanks. In the
>> meantime,
>> >>> >         is it
>> >>> >         >         possible for
>> >>> >         >         clients to 'know' when a controller is no longer
>> >>> >         available?
>> >>> >         >         For example,
>> >>> >         >         it would be nice if I can insert a callback
>> handler
>> >>> >         for this
>> >>> >         >         sort of
>> >>> >         >         internal exception so I can provide some
>> graceful
>> >>> >         recovery
>> >>> >         >         options.
>> >>> >         >
>> >>> >         >
>> >>> >         > It would be sensible to add a heartbeat mechanism on the
>> >>> >         > controller->client PUB channel for this information.
>> Until
>> >>> >         then, your
>> >>> >         > main controller crash detection is going to be simple
>> >>> >         timeouts.
>> >>> >         >
>> >>> >         >
>> >>> >         > ZeroMQ makes disconnect detection a challenge (because
>> >>> there
>> >>> >         are no
>> >>> >         > disconnect events, because a disconnected channel is
>> still
>> >>> >         valid, as
>> >>> >         > the peer is allowed to just come back up).
>> >>> >         >
>> >>> >         >
>> >>> >         >         >
>> >>> >         >         >
>> >>> >         >         >         Perhaps a related but separate notion
>> >>> >         would be the
>> >>> >         >         ability to
>> >>> >         >         >         have
>> >>> >         >         >         clustered controllers for HA.
>> >>> >         >         >
>> >>> >         >         >
>> >>> >         >         > I do have a model in mind for this sort of
>> thing,
>> >>> >         though not
>> >>> >         >         multiple
>> >>> >         >         > *controllers*, rather multiple Schedulers.
>> Our
>> >>> >         design with
>> >>> >         >         0MQ would
>> >>> >         >         > make this pretty simple (just start another
>> >>> >         scheduler, and
>> >>> >         >         make an
>> >>> >         >         > extra call to socket.connect() on the Client
>> and
>> >>> >         Engine is
>> >>> >         >         all that's
>> >>> >         >         > needed), and this should allow scaling to tens
>> of
>> >>> >         thousands
>> >>> >         >         of
>> >>> >         >         > engines.
>> >>> >         >
>> >>> >         >
>> >>> >         >         Yes! That's what I'm after. In this cloud-scale
>> age
>> >>> >         of
>> >>> >         >         computing, that
>> >>> >         >         would be ideal.
>> >>> >         >
>> >>> >         >
>> >>> >         >         Thanks Min.
>> >>> >         >
>> >>> >         >         >
>> >>> >         >         >
>> >>> >         >         >         On Sun, 2012-02-12 at 08:32 -0800, Min
>> RK
>> >>> >         wrote:
>> >>> >         >         >         > No, there is no failover mechanism.
>> >>> >          When the
>> >>> >         >         controller
>> >>> >         >         >         goes down, further requests will
>> simply
>> >>> >         hang.  We
>> >>> >         >         have almost
>> >>> >         >         >         all the information we need to bring
>> up a
>> >>> >         new
>> >>> >         >         controller in
>> >>> >         >         >         its place (restart it), in which case
>> the
>> >>> >         Client
>> >>> >         >         wouldn't even
>> >>> >         >         >         need to know that it went down, and
>> would
>> >>> >         continue
>> >>> >         >         to just
>> >>> >         >         >         work, thanks to some zeromq magic.
>> >>> >         >         >         >
>> >>> >         >         >         > -MinRK
>> >>> >         >         >         >
>> >>> >         >         >         > On Feb 12, 2012, at 5:02, Darren
>> Govoni
>> >>> >         >         >         <darren@ontrenet.com> wrote:
>> >>> >         >         >         >
>> >>> >         >         >         > > Hi,
>> >>> >         >         >         > >  Does ipython support any kind of
>> >>> >         clustering or
>> >>> >         >         failover
>> >>> >         >         >         for
>> >>> >         >         >         > > ipcontrollers? I'm wondering how
>> >>> >         situations are
>> >>> >         >         handled
>> >>> >         >         >         where a
>> >>> >         >         >         > > controller goes down when a client
>> >>> >         needs to
>> >>> >         >         perform
>> >>> >         >         >         something.
>> >>> >         >         >         > >
>> >>> >         >         >         > > thanks for any tips.
>> >>> >         >         >         > > Darren
>> >>> >         >         >         > >
>> >>> >         >         >         > >
>> >>> >         _______________________________________________
>> >>> >         >         >         > > IPython-User mailing list
>> >>> >         >         >         > > IPython-User@scipy.org
>> >>> >         >         >         > >
>> >>> >         >
>> http://mail.scipy.org/mailman/listinfo/ipython-user
>> >>> >         >         >         >
>> >>> >         _______________________________________________
>> >>> >         >         >         > IPython-User mailing list
>> >>> >         >         >         > IPython-User@scipy.org
>> >>> >         >         >         >
>> >>> >         >
>> http://mail.scipy.org/mailman/listinfo/ipython-user
>> >>> >         >         >
>> >>> >         >         >
>> >>> >         >         >
>> >>> >         _______________________________________________
>> >>> >         >         >         IPython-User mailing list
>> >>> >         >         >         IPython-User@scipy.org
>> >>> >         >         >
>> >>> >         http://mail.scipy.org/mailman/listinfo/ipython-user
>> >>> >         >         >
>> >>> >         >         >
>> >>> >         >         > _______________________________________________
>> >>> >         >         > IPython-User mailing list
>> >>> >         >         > IPython-User@scipy.org
>> >>> >         >         >
>> >>> >         http://mail.scipy.org/mailman/listinfo/ipython-user
>> >>> >         >
>> >>> >         >
>> >>> >         >         _______________________________________________
>> >>> >         >         IPython-User mailing list
>> >>> >         >         IPython-User@scipy.org
>> >>> >         >
>> http://mail.scipy.org/mailman/listinfo/ipython-user
>> >>> >         >
>> >>> >         >
>> >>> >         > _______________________________________________
>> >>> >         > IPython-User mailing list
>> >>> >         > IPython-User@scipy.org
>> >>> >         > http://mail.scipy.org/mailman/listinfo/ipython-user
>> >>> >
>> >>> >
>> >>> >         _______________________________________________
>> >>> >         IPython-User mailing list
>> >>> >         IPython-User@scipy.org
>> >>> >         http://mail.scipy.org/mailman/listinfo/ipython-user
>> >>> >
>> >>> >
>> >>> > _______________________________________________
>> >>> > IPython-User mailing list
>> >>> > IPython-User@scipy.org
>> >>> > http://mail.scipy.org/mailman/listinfo/ipython-user
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> IPython-User mailing list
>> >>> IPython-User@scipy.org
>> >>> http://mail.scipy.org/mailman/listinfo/ipython-user
>> >>>
>> >>
>> >>
>> > _______________________________________________
>> > IPython-User mailing list
>> > IPython-User@scipy.org
>> > http://mail.scipy.org/mailman/listinfo/ipython-user
>> >
>>
>> _______________________________________________
>> IPython-User mailing list
>> IPython-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-user
>>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>



More information about the IPython-User mailing list