[IPython-User] ipcontroller failover?

MinRK benjaminrk@gmail....
Tue Mar 6 17:24:20 CST 2012


Sounds neat!

What do you expect/want to happen regarding tasks that are

a) waiting in the Scheduler
b) *finished* on engines
c) submitted by clients

while the controller is down?

-MinRK

On Tue, Mar 6, 2012 at 14:23, <darren@ontrenet.com> wrote:

> Sure.
>
> We're developing a cloud-based software system that processes
> documents/files etc. It is currently built around Amazon cloud APIs but we
> want to lose that dependency. So we have a single portal server that
> provides the user experience and acts as "controller" of the other
> servers. They can launch up to 100 virtual servers and then assign those
> servers to queues.
>
> Work messages are sent to the queues and each server fetches one message.
> Amazon has a queue service for this, but its not as fast as ipython. It
> is, however, fault tolerant.
>
> We want to move that internally to ipython with its nice load balancing
> features. The portal server will house the controller and each possible
> server (up to 100) will have 1 or more engines connected to it.
>
> One aspect of our design is that it must accommodate hardware failures.
> Currently any of the worker servers can just "disappear" without affecting
> the outcome. Likewise, new ones can emerge and help with the work.
>
> The portal server can also be rebooted or relaunched if necessary because
> all the cloud data is "in the cloud".
>
> Since our portal acts as the "head" of the system, if it runs ipcontroller
> and is rebooted (which is allowed), then all 100 servers get confused and
> won't know to reconnect. I can write some logic to force this, but seems
> easier for ipcontroller to remember itself. Glad to see its a coming
> feature!
>
> > Can I ask more about what your environment is like, and the typical
> > circumstances of controller shutdown / crash?
> >
> > How often does the controller die, how many tasks are pending in the
> > Schedulers, and how many are active on engines when this happens?  What
> > are
> > your expectations/hopes/dreams for behavior if the controller goes down
> > while a bunch of work is in-flight?
> >
> > -MinRK
> >
> > On Tue, Mar 6, 2012 at 13:20, <darren@ontrenet.com> wrote:
> >
> >> Wow. Awesome. Let me try it. Many thanks.
> >>
> >> > You might check out this first-go implementation:
> >> >
> >> > https://github.com/ipython/ipython/pull/1471
> >> >
> >> > It seems to work fine if the cluster was idle at controller crash, but
> >> I
> >> > haven't tested the behavior of running jobs.  I'm certain that the
> >> > propagation of results of jobs submitted before shutdown all the way
> >> up
> >> to
> >> > interactive Clients is broken, but the results should still arrive in
> >> the
> >> > Hub's db.
> >> >
> >> > -MinRK
> >> >
> >> >
> >> > On Mon, Mar 5, 2012 at 16:38, MinRK <benjaminrk@gmail.com> wrote:
> >> >
> >> >> Correct, engines do not reconnect to a new controller, and right now
> >> a
> >> >> Controller is a single point of failure.
> >> >>
> >> >> We absolutely do intend to enable restarting the controller, and it
> >> >> wouldn't be remotely difficult, the code just isn't written yet.
> >> >>
> >> >> Steps required for this:
> >> >>
> >> >> 1. persist engine connection state to files/db (engine ID/UUID
> >> mapping
> >> >> should)
> >> >> 2. when starting up, load this information into the Hub, instead of
> >> >> starting from scratch
> >> >>
> >> >> That is all.  No change should be required in the engines or clients,
> >> as
> >> >> zeromq handles the reconnect automagically.
> >> >>
> >> >> There is already enough information stored in the *task* database to
> >> >> resume all tasks that were waiting in the Scheduler, but I'm not sure
> >> >> whether this should be done by default, or only on request.
> >> >>
> >> >> -MinRK
> >> >>
> >> >> On Mon, Mar 5, 2012 at 15:17, Darren Govoni <darren@ontrenet.com>
> >> wrote:
> >> >>
> >> >>> Hi,
> >> >>>
> >> >>> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
> >> >>> > It may also be unnecessary, because if the controller comes up at
> >> the
> >> >>> > same endpoint(s), then zeromq handles all the reconnects
> >> invisibly.
> >> >>> A
> >> >>> > connection to an endpoint is always valid, whether or not there is
> >> a
> >> >>> > socket present at any given point in time.
> >> >>>
> >> >>>   I tried an example to see this. I ran an ipcontroller on one
> >> machine
> >> >>> with static --port=21001 so engine client files would always be
> >> valid.
> >> >>>
> >> >>
> >> >> Just specifying the registration port isn't enough information, and
> >> you
> >> >> should be using `--reuse` or `IPControllerApp.reuse_files=True` for
> >> >> connection files to remain valid across sessions.
> >> >>
> >> >>
> >> >>>
> >> >>> I connected one engine from another server.
> >> >>>
> >> >>> I killed the controller and restarted it.
> >> >>>
> >> >>> After doing:
> >> >>>
> >> >>> client = Client()
> >> >>> client.ids
> >> >>> []
> >> >>>
> >> >>> There are no longer any engines connected.
> >> >>>
> >> >>> dview = client[:]
> >> >>> ...
> >> >>> NoEnginesRegistered: Can't build targets without any engines
> >> >>>
> >> >>> The problem perhaps is that for any large scale system, say 1
> >> >>> controller
> >> >>> with 50 engines running on 50 servers, this single-point-of-failure
> >> is
> >> >>> hard to remedy.
> >> >>>
> >> >>> Is there a way to tell the controller to reconnect to last known
> >> engine
> >> >>> IP addresses? Or some other way to re-establish the grid? Rebooting
> >> 50
> >> >>> servers is not a good option for us.
> >> >>>
> >> >>> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
> >> >>> >
> >> >>> >
> >> >>> > On Sun, Feb 12, 2012 at 13:02, Darren Govoni <darren@ontrenet.com
> >
> >> >>> > wrote:
> >> >>> >         Correct me if I'm wrong, but do the ipengines 'connect' or
> >> >>> >         otherwise
> >> >>> >         announce their presence to the controller?
> >> >>> >
> >> >>> >
> >> >>> > Yes, 100% of the connections are inbound to the controller
> >> processes,
> >> >>> > from clients and engines alike.  This is a strict requirement,
> >> >>> because
> >> >>> > it would not be acceptable for engines to need open ports for
> >> inbound
> >> >>> > connections.  Simply bringing up a new controller with the same
> >> >>> > connection information would result in the cluster continuing to
> >> >>> > function, with the engines and client never realizing the
> >> controller
> >> >>> > went down at all, nor having to act on it in any way.
> >> >>> >
> >> >>> >         If it were the other way
> >> >>> >         around, then this would accommodate some degree of fault
> >> >>> >         tolerance for
> >> >>> >         the controller because it could be restarted by a watching
> >> >>> dog
> >> >>> >         and the
> >> >>> >         re-establish the connected state of the cluster. i.e. a
> >> >>> >         controller comes
> >> >>> >         online. a pub/sub message is sent to a known channel and
> >> >>> >         clients or
> >> >>> >         engines add the new ipcontroller to its internal list as a
> >> >>> >         failover
> >> >>> >         endpoint.
> >> >>> >
> >> >>> >
> >> >>> > This is still possible without reversing connection direction.
> >> Note
> >> >>> > that in zeromq there is *exactly zero* correlation between
> >> >>> > communication direction and connection direction.  PUB can connect
> >> to
> >> >>> > SUB, and vice versa.  In fact a single socket can bind and connect
> >> at
> >> >>> > the same time.
> >> >>> >
> >> >>> >
> >> >>> > It may also be unnecessary, because if the controller comes up at
> >> the
> >> >>> > same endpoint(s), then zeromq handles all the reconnects
> >> invisibly.
> >> >>> A
> >> >>> > connection to an endpoint is always valid, whether or not there is
> >> a
> >> >>> > socket present at any given point in time.
> >> >>> >
> >> >>> >
> >> >>> >         On Sun, 2012-02-12 at 12:06 -0800, MinRK wrote:
> >> >>> >         >
> >> >>> >         >
> >> >>> >         > On Sun, Feb 12, 2012 at 11:48, Darren Govoni
> >> >>> >         <darren@ontrenet.com>
> >> >>> >         > wrote:
> >> >>> >         >         On Sun, 2012-02-12 at 11:12 -0800, MinRK wrote:
> >> >>> >         >         >
> >> >>> >         >         >
> >> >>> >         >         > On Sun, Feb 12, 2012 at 10:42, Darren Govoni
> >> >>> >         >         <darren@ontrenet.com>
> >> >>> >         >         > wrote:
> >> >>> >         >         >         Thanks Min,
> >> >>> >         >         >
> >> >>> >         >         >         Is it possible to open a ticket for
> >> this
> >> >>> >         capability
> >> >>> >         >         for a
> >> >>> >         >         >         (near) future
> >> >>> >         >         >         release? It compliments that already
> >> >>> >         amazing load
> >> >>> >         >         balancing
> >> >>> >         >         >         capability.
> >> >>> >         >         >
> >> >>> >         >         >
> >> >>> >         >         > You are welcome to open an Issue.  I don't
> >> know
> >> >>> if
> >> >>> >         it will
> >> >>> >         >         make it
> >> >>> >         >         > into one of the next few releases, but it is
> >> on
> >> >>> my
> >> >>> >         todo
> >> >>> >         >         list.  The
> >> >>> >         >         > best way to get this sort of thing going is to
> >> >>> >         start with a
> >> >>> >         >         Pull
> >> >>> >         >         > Request.
> >> >>> >         >
> >> >>> >         >
> >> >>> >         >         Ok, I will open an issue. Thanks. In the
> >> meantime,
> >> >>> >         is it
> >> >>> >         >         possible for
> >> >>> >         >         clients to 'know' when a controller is no longer
> >> >>> >         available?
> >> >>> >         >         For example,
> >> >>> >         >         it would be nice if I can insert a callback
> >> handler
> >> >>> >         for this
> >> >>> >         >         sort of
> >> >>> >         >         internal exception so I can provide some
> >> graceful
> >> >>> >         recovery
> >> >>> >         >         options.
> >> >>> >         >
> >> >>> >         >
> >> >>> >         > It would be sensible to add a heartbeat mechanism on the
> >> >>> >         > controller->client PUB channel for this information.
> >> Until
> >> >>> >         then, your
> >> >>> >         > main controller crash detection is going to be simple
> >> >>> >         timeouts.
> >> >>> >         >
> >> >>> >         >
> >> >>> >         > ZeroMQ makes disconnect detection a challenge (because
> >> >>> there
> >> >>> >         are no
> >> >>> >         > disconnect events, because a disconnected channel is
> >> still
> >> >>> >         valid, as
> >> >>> >         > the peer is allowed to just come back up).
> >> >>> >         >
> >> >>> >         >
> >> >>> >         >         >
> >> >>> >         >         >
> >> >>> >         >         >         Perhaps a related but separate notion
> >> >>> >         would be the
> >> >>> >         >         ability to
> >> >>> >         >         >         have
> >> >>> >         >         >         clustered controllers for HA.
> >> >>> >         >         >
> >> >>> >         >         >
> >> >>> >         >         > I do have a model in mind for this sort of
> >> thing,
> >> >>> >         though not
> >> >>> >         >         multiple
> >> >>> >         >         > *controllers*, rather multiple Schedulers.
> >> Our
> >> >>> >         design with
> >> >>> >         >         0MQ would
> >> >>> >         >         > make this pretty simple (just start another
> >> >>> >         scheduler, and
> >> >>> >         >         make an
> >> >>> >         >         > extra call to socket.connect() on the Client
> >> and
> >> >>> >         Engine is
> >> >>> >         >         all that's
> >> >>> >         >         > needed), and this should allow scaling to tens
> >> of
> >> >>> >         thousands
> >> >>> >         >         of
> >> >>> >         >         > engines.
> >> >>> >         >
> >> >>> >         >
> >> >>> >         >         Yes! That's what I'm after. In this cloud-scale
> >> age
> >> >>> >         of
> >> >>> >         >         computing, that
> >> >>> >         >         would be ideal.
> >> >>> >         >
> >> >>> >         >
> >> >>> >         >         Thanks Min.
> >> >>> >         >
> >> >>> >         >         >
> >> >>> >         >         >
> >> >>> >         >         >         On Sun, 2012-02-12 at 08:32 -0800, Min
> >> RK
> >> >>> >         wrote:
> >> >>> >         >         >         > No, there is no failover mechanism.
> >> >>> >          When the
> >> >>> >         >         controller
> >> >>> >         >         >         goes down, further requests will
> >> simply
> >> >>> >         hang.  We
> >> >>> >         >         have almost
> >> >>> >         >         >         all the information we need to bring
> >> up a
> >> >>> >         new
> >> >>> >         >         controller in
> >> >>> >         >         >         its place (restart it), in which case
> >> the
> >> >>> >         Client
> >> >>> >         >         wouldn't even
> >> >>> >         >         >         need to know that it went down, and
> >> would
> >> >>> >         continue
> >> >>> >         >         to just
> >> >>> >         >         >         work, thanks to some zeromq magic.
> >> >>> >         >         >         >
> >> >>> >         >         >         > -MinRK
> >> >>> >         >         >         >
> >> >>> >         >         >         > On Feb 12, 2012, at 5:02, Darren
> >> Govoni
> >> >>> >         >         >         <darren@ontrenet.com> wrote:
> >> >>> >         >         >         >
> >> >>> >         >         >         > > Hi,
> >> >>> >         >         >         > >  Does ipython support any kind of
> >> >>> >         clustering or
> >> >>> >         >         failover
> >> >>> >         >         >         for
> >> >>> >         >         >         > > ipcontrollers? I'm wondering how
> >> >>> >         situations are
> >> >>> >         >         handled
> >> >>> >         >         >         where a
> >> >>> >         >         >         > > controller goes down when a client
> >> >>> >         needs to
> >> >>> >         >         perform
> >> >>> >         >         >         something.
> >> >>> >         >         >         > >
> >> >>> >         >         >         > > thanks for any tips.
> >> >>> >         >         >         > > Darren
> >> >>> >         >         >         > >
> >> >>> >         >         >         > >
> >> >>> >         _______________________________________________
> >> >>> >         >         >         > > IPython-User mailing list
> >> >>> >         >         >         > > IPython-User@scipy.org
> >> >>> >         >         >         > >
> >> >>> >         >
> >> http://mail.scipy.org/mailman/listinfo/ipython-user
> >> >>> >         >         >         >
> >> >>> >         _______________________________________________
> >> >>> >         >         >         > IPython-User mailing list
> >> >>> >         >         >         > IPython-User@scipy.org
> >> >>> >         >         >         >
> >> >>> >         >
> >> http://mail.scipy.org/mailman/listinfo/ipython-user
> >> >>> >         >         >
> >> >>> >         >         >
> >> >>> >         >         >
> >> >>> >         _______________________________________________
> >> >>> >         >         >         IPython-User mailing list
> >> >>> >         >         >         IPython-User@scipy.org
> >> >>> >         >         >
> >> >>> >         http://mail.scipy.org/mailman/listinfo/ipython-user
> >> >>> >         >         >
> >> >>> >         >         >
> >> >>> >         >         >
> _______________________________________________
> >> >>> >         >         > IPython-User mailing list
> >> >>> >         >         > IPython-User@scipy.org
> >> >>> >         >         >
> >> >>> >         http://mail.scipy.org/mailman/listinfo/ipython-user
> >> >>> >         >
> >> >>> >         >
> >> >>> >         >         _______________________________________________
> >> >>> >         >         IPython-User mailing list
> >> >>> >         >         IPython-User@scipy.org
> >> >>> >         >
> >> http://mail.scipy.org/mailman/listinfo/ipython-user
> >> >>> >         >
> >> >>> >         >
> >> >>> >         > _______________________________________________
> >> >>> >         > IPython-User mailing list
> >> >>> >         > IPython-User@scipy.org
> >> >>> >         > http://mail.scipy.org/mailman/listinfo/ipython-user
> >> >>> >
> >> >>> >
> >> >>> >         _______________________________________________
> >> >>> >         IPython-User mailing list
> >> >>> >         IPython-User@scipy.org
> >> >>> >         http://mail.scipy.org/mailman/listinfo/ipython-user
> >> >>> >
> >> >>> >
> >> >>> > _______________________________________________
> >> >>> > IPython-User mailing list
> >> >>> > IPython-User@scipy.org
> >> >>> > http://mail.scipy.org/mailman/listinfo/ipython-user
> >> >>>
> >> >>>
> >> >>> _______________________________________________
> >> >>> IPython-User mailing list
> >> >>> IPython-User@scipy.org
> >> >>> http://mail.scipy.org/mailman/listinfo/ipython-user
> >> >>>
> >> >>
> >> >>
> >> > _______________________________________________
> >> > IPython-User mailing list
> >> > IPython-User@scipy.org
> >> > http://mail.scipy.org/mailman/listinfo/ipython-user
> >> >
> >>
> >> _______________________________________________
> >> IPython-User mailing list
> >> IPython-User@scipy.org
> >> http://mail.scipy.org/mailman/listinfo/ipython-user
> >>
> > _______________________________________________
> > IPython-User mailing list
> > IPython-User@scipy.org
> > http://mail.scipy.org/mailman/listinfo/ipython-user
> >
>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20120306/ac5b3685/attachment-0001.html 


More information about the IPython-User mailing list