[IPython-User] ipcontroller failover?

darren darren@ontrenet....
Tue Mar 6 17:54:30 CST 2012


I am still very new to ipython and lack some of the deeper knowledge to
answer this directly but I will try it this way.

In similar queue servers I've written, I tried to keep all the necessary
state offline (also in mongodb) such that restarting a server can
recover its state. Sometimes it means pending messages are requeued
(from offline storage) because they were not ack'd while the server was
up. On the client engine, when that happens, it should just forget about
the task knowing the controller will requeue it to acquire an ack later.
If it hangs onto the un-ack'd task, that could add up and cause a load
issue for that engine/client. It should keep its plate as clean as
possible and not maintain state for long if at all.

This potential redundancy in processing is an "ok" tradeoff (for my
project at least) because A) it is rare and B) its better than letting
something slip through the cracks. 

Hopefully, I made a bit of sense...

On Tue, 2012-03-06 at 15:24 -0800, MinRK wrote:
> Sounds neat!
> 
> 
> What do you expect/want to happen regarding tasks that are 
> 
> 
> a) waiting in the Scheduler
> b) *finished* on engines
> c) submitted by clients
> 
> 
> while the controller is down?
> 
> 
> -MinRK
> 
> On Tue, Mar 6, 2012 at 14:23, <darren@ontrenet.com> wrote:
>         Sure.
>         
>         We're developing a cloud-based software system that processes
>         documents/files etc. It is currently built around Amazon cloud
>         APIs but we
>         want to lose that dependency. So we have a single portal
>         server that
>         provides the user experience and acts as "controller" of the
>         other
>         servers. They can launch up to 100 virtual servers and then
>         assign those
>         servers to queues.
>         
>         Work messages are sent to the queues and each server fetches
>         one message.
>         Amazon has a queue service for this, but its not as fast as
>         ipython. It
>         is, however, fault tolerant.
>         
>         We want to move that internally to ipython with its nice load
>         balancing
>         features. The portal server will house the controller and each
>         possible
>         server (up to 100) will have 1 or more engines connected to
>         it.
>         
>         One aspect of our design is that it must accommodate hardware
>         failures.
>         Currently any of the worker servers can just "disappear"
>         without affecting
>         the outcome. Likewise, new ones can emerge and help with the
>         work.
>         
>         The portal server can also be rebooted or relaunched if
>         necessary because
>         all the cloud data is "in the cloud".
>         
>         Since our portal acts as the "head" of the system, if it runs
>         ipcontroller
>         and is rebooted (which is allowed), then all 100 servers get
>         confused and
>         won't know to reconnect. I can write some logic to force this,
>         but seems
>         easier for ipcontroller to remember itself. Glad to see its a
>         coming
>         feature!
>         
>         > Can I ask more about what your environment is like, and the
>         typical
>         > circumstances of controller shutdown / crash?
>         >
>         > How often does the controller die, how many tasks are
>         pending in the
>         > Schedulers, and how many are active on engines when this
>         happens?  What
>         > are
>         > your expectations/hopes/dreams for behavior if the
>         controller goes down
>         > while a bunch of work is in-flight?
>         >
>         > -MinRK
>         >
>         > On Tue, Mar 6, 2012 at 13:20, <darren@ontrenet.com> wrote:
>         >
>         >> Wow. Awesome. Let me try it. Many thanks.
>         >>
>         >> > You might check out this first-go implementation:
>         >> >
>         >> > https://github.com/ipython/ipython/pull/1471
>         >> >
>         >> > It seems to work fine if the cluster was idle at
>         controller crash, but
>         >> I
>         >> > haven't tested the behavior of running jobs.  I'm certain
>         that the
>         >> > propagation of results of jobs submitted before shutdown
>         all the way
>         >> up
>         >> to
>         >> > interactive Clients is broken, but the results should
>         still arrive in
>         >> the
>         >> > Hub's db.
>         >> >
>         >> > -MinRK
>         >> >
>         >> >
>         >> > On Mon, Mar 5, 2012 at 16:38, MinRK
>         <benjaminrk@gmail.com> wrote:
>         >> >
>         >> >> Correct, engines do not reconnect to a new controller,
>         and right now
>         >> a
>         >> >> Controller is a single point of failure.
>         >> >>
>         >> >> We absolutely do intend to enable restarting the
>         controller, and it
>         >> >> wouldn't be remotely difficult, the code just isn't
>         written yet.
>         >> >>
>         >> >> Steps required for this:
>         >> >>
>         >> >> 1. persist engine connection state to files/db (engine
>         ID/UUID
>         >> mapping
>         >> >> should)
>         >> >> 2. when starting up, load this information into the Hub,
>         instead of
>         >> >> starting from scratch
>         >> >>
>         >> >> That is all.  No change should be required in the
>         engines or clients,
>         >> as
>         >> >> zeromq handles the reconnect automagically.
>         >> >>
>         >> >> There is already enough information stored in the *task*
>         database to
>         >> >> resume all tasks that were waiting in the Scheduler, but
>         I'm not sure
>         >> >> whether this should be done by default, or only on
>         request.
>         >> >>
>         >> >> -MinRK
>         >> >>
>         >> >> On Mon, Mar 5, 2012 at 15:17, Darren Govoni
>         <darren@ontrenet.com>
>         >> wrote:
>         >> >>
>         >> >>> Hi,
>         >> >>>
>         >> >>> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
>         >> >>> > It may also be unnecessary, because if the controller
>         comes up at
>         >> the
>         >> >>> > same endpoint(s), then zeromq handles all the
>         reconnects
>         >> invisibly.
>         >> >>> A
>         >> >>> > connection to an endpoint is always valid, whether or
>         not there is
>         >> a
>         >> >>> > socket present at any given point in time.
>         >> >>>
>         >> >>>   I tried an example to see this. I ran an ipcontroller
>         on one
>         >> machine
>         >> >>> with static --port=21001 so engine client files would
>         always be
>         >> valid.
>         >> >>>
>         >> >>
>         >> >> Just specifying the registration port isn't enough
>         information, and
>         >> you
>         >> >> should be using `--reuse` or
>         `IPControllerApp.reuse_files=True` for
>         >> >> connection files to remain valid across sessions.
>         >> >>
>         >> >>
>         >> >>>
>         >> >>> I connected one engine from another server.
>         >> >>>
>         >> >>> I killed the controller and restarted it.
>         >> >>>
>         >> >>> After doing:
>         >> >>>
>         >> >>> client = Client()
>         >> >>> client.ids
>         >> >>> []
>         >> >>>
>         >> >>> There are no longer any engines connected.
>         >> >>>
>         >> >>> dview = client[:]
>         >> >>> ...
>         >> >>> NoEnginesRegistered: Can't build targets without any
>         engines
>         >> >>>
>         >> >>> The problem perhaps is that for any large scale system,
>         say 1
>         >> >>> controller
>         >> >>> with 50 engines running on 50 servers, this
>         single-point-of-failure
>         >> is
>         >> >>> hard to remedy.
>         >> >>>
>         >> >>> Is there a way to tell the controller to reconnect to
>         last known
>         >> engine
>         >> >>> IP addresses? Or some other way to re-establish the
>         grid? Rebooting
>         >> 50
>         >> >>> servers is not a good option for us.
>         >> >>>
>         >> >>> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
>         >> >>> >
>         >> >>> >
>         >> >>> > On Sun, Feb 12, 2012 at 13:02, Darren Govoni
>         <darren@ontrenet.com>
>         >> >>> > wrote:
>         >> >>> >         Correct me if I'm wrong, but do the ipengines
>         'connect' or
>         >> >>> >         otherwise
>         >> >>> >         announce their presence to the controller?
>         >> >>> >
>         >> >>> >
>         >> >>> > Yes, 100% of the connections are inbound to the
>         controller
>         >> processes,
>         >> >>> > from clients and engines alike.  This is a strict
>         requirement,
>         >> >>> because
>         >> >>> > it would not be acceptable for engines to need open
>         ports for
>         >> inbound
>         >> >>> > connections.  Simply bringing up a new controller
>         with the same
>         >> >>> > connection information would result in the cluster
>         continuing to
>         >> >>> > function, with the engines and client never realizing
>         the
>         >> controller
>         >> >>> > went down at all, nor having to act on it in any way.
>         >> >>> >
>         >> >>> >         If it were the other way
>         >> >>> >         around, then this would accommodate some
>         degree of fault
>         >> >>> >         tolerance for
>         >> >>> >         the controller because it could be restarted
>         by a watching
>         >> >>> dog
>         >> >>> >         and the
>         >> >>> >         re-establish the connected state of the
>         cluster. i.e. a
>         >> >>> >         controller comes
>         >> >>> >         online. a pub/sub message is sent to a known
>         channel and
>         >> >>> >         clients or
>         >> >>> >         engines add the new ipcontroller to its
>         internal list as a
>         >> >>> >         failover
>         >> >>> >         endpoint.
>         >> >>> >
>         >> >>> >
>         >> >>> > This is still possible without reversing connection
>         direction.
>         >> Note
>         >> >>> > that in zeromq there is *exactly zero* correlation
>         between
>         >> >>> > communication direction and connection direction.
>          PUB can connect
>         >> to
>         >> >>> > SUB, and vice versa.  In fact a single socket can
>         bind and connect
>         >> at
>         >> >>> > the same time.
>         >> >>> >
>         >> >>> >
>         >> >>> > It may also be unnecessary, because if the controller
>         comes up at
>         >> the
>         >> >>> > same endpoint(s), then zeromq handles all the
>         reconnects
>         >> invisibly.
>         >> >>> A
>         >> >>> > connection to an endpoint is always valid, whether or
>         not there is
>         >> a
>         >> >>> > socket present at any given point in time.
>         >> >>> >
>         >> >>> >
>         >> >>> >         On Sun, 2012-02-12 at 12:06 -0800, MinRK
>         wrote:
>         >> >>> >         >
>         >> >>> >         >
>         >> >>> >         > On Sun, Feb 12, 2012 at 11:48, Darren
>         Govoni
>         >> >>> >         <darren@ontrenet.com>
>         >> >>> >         > wrote:
>         >> >>> >         >         On Sun, 2012-02-12 at 11:12 -0800,
>         MinRK wrote:
>         >> >>> >         >         >
>         >> >>> >         >         >
>         >> >>> >         >         > On Sun, Feb 12, 2012 at 10:42,
>         Darren Govoni
>         >> >>> >         >         <darren@ontrenet.com>
>         >> >>> >         >         > wrote:
>         >> >>> >         >         >         Thanks Min,
>         >> >>> >         >         >
>         >> >>> >         >         >         Is it possible to open a
>         ticket for
>         >> this
>         >> >>> >         capability
>         >> >>> >         >         for a
>         >> >>> >         >         >         (near) future
>         >> >>> >         >         >         release? It compliments
>         that already
>         >> >>> >         amazing load
>         >> >>> >         >         balancing
>         >> >>> >         >         >         capability.
>         >> >>> >         >         >
>         >> >>> >         >         >
>         >> >>> >         >         > You are welcome to open an
>         Issue.  I don't
>         >> know
>         >> >>> if
>         >> >>> >         it will
>         >> >>> >         >         make it
>         >> >>> >         >         > into one of the next few
>         releases, but it is
>         >> on
>         >> >>> my
>         >> >>> >         todo
>         >> >>> >         >         list.  The
>         >> >>> >         >         > best way to get this sort of
>         thing going is to
>         >> >>> >         start with a
>         >> >>> >         >         Pull
>         >> >>> >         >         > Request.
>         >> >>> >         >
>         >> >>> >         >
>         >> >>> >         >         Ok, I will open an issue. Thanks.
>         In the
>         >> meantime,
>         >> >>> >         is it
>         >> >>> >         >         possible for
>         >> >>> >         >         clients to 'know' when a controller
>         is no longer
>         >> >>> >         available?
>         >> >>> >         >         For example,
>         >> >>> >         >         it would be nice if I can insert a
>         callback
>         >> handler
>         >> >>> >         for this
>         >> >>> >         >         sort of
>         >> >>> >         >         internal exception so I can provide
>         some
>         >> graceful
>         >> >>> >         recovery
>         >> >>> >         >         options.
>         >> >>> >         >
>         >> >>> >         >
>         >> >>> >         > It would be sensible to add a heartbeat
>         mechanism on the
>         >> >>> >         > controller->client PUB channel for this
>         information.
>         >> Until
>         >> >>> >         then, your
>         >> >>> >         > main controller crash detection is going to
>         be simple
>         >> >>> >         timeouts.
>         >> >>> >         >
>         >> >>> >         >
>         >> >>> >         > ZeroMQ makes disconnect detection a
>         challenge (because
>         >> >>> there
>         >> >>> >         are no
>         >> >>> >         > disconnect events, because a disconnected
>         channel is
>         >> still
>         >> >>> >         valid, as
>         >> >>> >         > the peer is allowed to just come back up).
>         >> >>> >         >
>         >> >>> >         >
>         >> >>> >         >         >
>         >> >>> >         >         >
>         >> >>> >         >         >         Perhaps a related but
>         separate notion
>         >> >>> >         would be the
>         >> >>> >         >         ability to
>         >> >>> >         >         >         have
>         >> >>> >         >         >         clustered controllers for
>         HA.
>         >> >>> >         >         >
>         >> >>> >         >         >
>         >> >>> >         >         > I do have a model in mind for
>         this sort of
>         >> thing,
>         >> >>> >         though not
>         >> >>> >         >         multiple
>         >> >>> >         >         > *controllers*, rather multiple
>         Schedulers.
>         >> Our
>         >> >>> >         design with
>         >> >>> >         >         0MQ would
>         >> >>> >         >         > make this pretty simple (just
>         start another
>         >> >>> >         scheduler, and
>         >> >>> >         >         make an
>         >> >>> >         >         > extra call to socket.connect() on
>         the Client
>         >> and
>         >> >>> >         Engine is
>         >> >>> >         >         all that's
>         >> >>> >         >         > needed), and this should allow
>         scaling to tens
>         >> of
>         >> >>> >         thousands
>         >> >>> >         >         of
>         >> >>> >         >         > engines.
>         >> >>> >         >
>         >> >>> >         >
>         >> >>> >         >         Yes! That's what I'm after. In this
>         cloud-scale
>         >> age
>         >> >>> >         of
>         >> >>> >         >         computing, that
>         >> >>> >         >         would be ideal.
>         >> >>> >         >
>         >> >>> >         >
>         >> >>> >         >         Thanks Min.
>         >> >>> >         >
>         >> >>> >         >         >
>         >> >>> >         >         >
>         >> >>> >         >         >         On Sun, 2012-02-12 at
>         08:32 -0800, Min
>         >> RK
>         >> >>> >         wrote:
>         >> >>> >         >         >         > No, there is no
>         failover mechanism.
>         >> >>> >          When the
>         >> >>> >         >         controller
>         >> >>> >         >         >         goes down, further
>         requests will
>         >> simply
>         >> >>> >         hang.  We
>         >> >>> >         >         have almost
>         >> >>> >         >         >         all the information we
>         need to bring
>         >> up a
>         >> >>> >         new
>         >> >>> >         >         controller in
>         >> >>> >         >         >         its place (restart it),
>         in which case
>         >> the
>         >> >>> >         Client
>         >> >>> >         >         wouldn't even
>         >> >>> >         >         >         need to know that it went
>         down, and
>         >> would
>         >> >>> >         continue
>         >> >>> >         >         to just
>         >> >>> >         >         >         work, thanks to some
>         zeromq magic.
>         >> >>> >         >         >         >
>         >> >>> >         >         >         > -MinRK
>         >> >>> >         >         >         >
>         >> >>> >         >         >         > On Feb 12, 2012, at
>         5:02, Darren
>         >> Govoni
>         >> >>> >         >         >         <darren@ontrenet.com>
>         wrote:
>         >> >>> >         >         >         >
>         >> >>> >         >         >         > > Hi,
>         >> >>> >         >         >         > >  Does ipython support
>         any kind of
>         >> >>> >         clustering or
>         >> >>> >         >         failover
>         >> >>> >         >         >         for
>         >> >>> >         >         >         > > ipcontrollers? I'm
>         wondering how
>         >> >>> >         situations are
>         >> >>> >         >         handled
>         >> >>> >         >         >         where a
>         >> >>> >         >         >         > > controller goes down
>         when a client
>         >> >>> >         needs to
>         >> >>> >         >         perform
>         >> >>> >         >         >         something.
>         >> >>> >         >         >         > >
>         >> >>> >         >         >         > > thanks for any tips.
>         >> >>> >         >         >         > > Darren
>         >> >>> >         >         >         > >
>         >> >>> >         >         >         > >
>         >> >>> >
>         _______________________________________________
>         >> >>> >         >         >         > > IPython-User mailing
>         list
>         >> >>> >         >         >         > >
>         IPython-User@scipy.org
>         >> >>> >         >         >         > >
>         >> >>> >         >
>         >> http://mail.scipy.org/mailman/listinfo/ipython-user
>         >> >>> >         >         >         >
>         >> >>> >
>         _______________________________________________
>         >> >>> >         >         >         > IPython-User mailing
>         list
>         >> >>> >         >         >         > IPython-User@scipy.org
>         >> >>> >         >         >         >
>         >> >>> >         >
>         >> http://mail.scipy.org/mailman/listinfo/ipython-user
>         >> >>> >         >         >
>         >> >>> >         >         >
>         >> >>> >         >         >
>         >> >>> >
>         _______________________________________________
>         >> >>> >         >         >         IPython-User mailing list
>         >> >>> >         >         >         IPython-User@scipy.org
>         >> >>> >         >         >
>         >> >>> >
>         http://mail.scipy.org/mailman/listinfo/ipython-user
>         >> >>> >         >         >
>         >> >>> >         >         >
>         >> >>> >         >         >
>         _______________________________________________
>         >> >>> >         >         > IPython-User mailing list
>         >> >>> >         >         > IPython-User@scipy.org
>         >> >>> >         >         >
>         >> >>> >
>         http://mail.scipy.org/mailman/listinfo/ipython-user
>         >> >>> >         >
>         >> >>> >         >
>         >> >>> >         >
>         _______________________________________________
>         >> >>> >         >         IPython-User mailing list
>         >> >>> >         >         IPython-User@scipy.org
>         >> >>> >         >
>         >> http://mail.scipy.org/mailman/listinfo/ipython-user
>         >> >>> >         >
>         >> >>> >         >
>         >> >>> >         >
>         _______________________________________________
>         >> >>> >         > IPython-User mailing list
>         >> >>> >         > IPython-User@scipy.org
>         >> >>> >         >
>         http://mail.scipy.org/mailman/listinfo/ipython-user
>         >> >>> >
>         >> >>> >
>         >> >>> >
>         _______________________________________________
>         >> >>> >         IPython-User mailing list
>         >> >>> >         IPython-User@scipy.org
>         >> >>> >
>         http://mail.scipy.org/mailman/listinfo/ipython-user
>         >> >>> >
>         >> >>> >
>         >> >>> > _______________________________________________
>         >> >>> > IPython-User mailing list
>         >> >>> > IPython-User@scipy.org
>         >> >>> > http://mail.scipy.org/mailman/listinfo/ipython-user
>         >> >>>
>         >> >>>
>         >> >>> _______________________________________________
>         >> >>> IPython-User mailing list
>         >> >>> IPython-User@scipy.org
>         >> >>> http://mail.scipy.org/mailman/listinfo/ipython-user
>         >> >>>
>         >> >>
>         >> >>
>         >> > _______________________________________________
>         >> > IPython-User mailing list
>         >> > IPython-User@scipy.org
>         >> > http://mail.scipy.org/mailman/listinfo/ipython-user
>         >> >
>         >>
>         >> _______________________________________________
>         >> IPython-User mailing list
>         >> IPython-User@scipy.org
>         >> http://mail.scipy.org/mailman/listinfo/ipython-user
>         >>
>         > _______________________________________________
>         > IPython-User mailing list
>         > IPython-User@scipy.org
>         > http://mail.scipy.org/mailman/listinfo/ipython-user
>         >
>         
>         _______________________________________________
>         IPython-User mailing list
>         IPython-User@scipy.org
>         http://mail.scipy.org/mailman/listinfo/ipython-user
>         
> 
> 
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user




More information about the IPython-User mailing list