[IPython-User] ipcontroller failover?

darren@ontrene... darren@ontrene...
Tue Mar 6 15:20:48 CST 2012


Wow. Awesome. Let me try it. Many thanks.

> You might check out this first-go implementation:
>
> https://github.com/ipython/ipython/pull/1471
>
> It seems to work fine if the cluster was idle at controller crash, but I
> haven't tested the behavior of running jobs.  I'm certain that the
> propagation of results of jobs submitted before shutdown all the way up to
> interactive Clients is broken, but the results should still arrive in the
> Hub's db.
>
> -MinRK
>
>
> On Mon, Mar 5, 2012 at 16:38, MinRK <benjaminrk@gmail.com> wrote:
>
>> Correct, engines do not reconnect to a new controller, and right now a
>> Controller is a single point of failure.
>>
>> We absolutely do intend to enable restarting the controller, and it
>> wouldn't be remotely difficult, the code just isn't written yet.
>>
>> Steps required for this:
>>
>> 1. persist engine connection state to files/db (engine ID/UUID mapping
>> should)
>> 2. when starting up, load this information into the Hub, instead of
>> starting from scratch
>>
>> That is all.  No change should be required in the engines or clients, as
>> zeromq handles the reconnect automagically.
>>
>> There is already enough information stored in the *task* database to
>> resume all tasks that were waiting in the Scheduler, but I'm not sure
>> whether this should be done by default, or only on request.
>>
>> -MinRK
>>
>> On Mon, Mar 5, 2012 at 15:17, Darren Govoni <darren@ontrenet.com> wrote:
>>
>>> Hi,
>>>
>>> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
>>> > It may also be unnecessary, because if the controller comes up at the
>>> > same endpoint(s), then zeromq handles all the reconnects invisibly.
>>> A
>>> > connection to an endpoint is always valid, whether or not there is a
>>> > socket present at any given point in time.
>>>
>>>   I tried an example to see this. I ran an ipcontroller on one machine
>>> with static --port=21001 so engine client files would always be valid.
>>>
>>
>> Just specifying the registration port isn't enough information, and you
>> should be using `--reuse` or `IPControllerApp.reuse_files=True` for
>> connection files to remain valid across sessions.
>>
>>
>>>
>>> I connected one engine from another server.
>>>
>>> I killed the controller and restarted it.
>>>
>>> After doing:
>>>
>>> client = Client()
>>> client.ids
>>> []
>>>
>>> There are no longer any engines connected.
>>>
>>> dview = client[:]
>>> ...
>>> NoEnginesRegistered: Can't build targets without any engines
>>>
>>> The problem perhaps is that for any large scale system, say 1
>>> controller
>>> with 50 engines running on 50 servers, this single-point-of-failure is
>>> hard to remedy.
>>>
>>> Is there a way to tell the controller to reconnect to last known engine
>>> IP addresses? Or some other way to re-establish the grid? Rebooting 50
>>> servers is not a good option for us.
>>>
>>> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
>>> >
>>> >
>>> > On Sun, Feb 12, 2012 at 13:02, Darren Govoni <darren@ontrenet.com>
>>> > wrote:
>>> >         Correct me if I'm wrong, but do the ipengines 'connect' or
>>> >         otherwise
>>> >         announce their presence to the controller?
>>> >
>>> >
>>> > Yes, 100% of the connections are inbound to the controller processes,
>>> > from clients and engines alike.  This is a strict requirement,
>>> because
>>> > it would not be acceptable for engines to need open ports for inbound
>>> > connections.  Simply bringing up a new controller with the same
>>> > connection information would result in the cluster continuing to
>>> > function, with the engines and client never realizing the controller
>>> > went down at all, nor having to act on it in any way.
>>> >
>>> >         If it were the other way
>>> >         around, then this would accommodate some degree of fault
>>> >         tolerance for
>>> >         the controller because it could be restarted by a watching
>>> dog
>>> >         and the
>>> >         re-establish the connected state of the cluster. i.e. a
>>> >         controller comes
>>> >         online. a pub/sub message is sent to a known channel and
>>> >         clients or
>>> >         engines add the new ipcontroller to its internal list as a
>>> >         failover
>>> >         endpoint.
>>> >
>>> >
>>> > This is still possible without reversing connection direction.  Note
>>> > that in zeromq there is *exactly zero* correlation between
>>> > communication direction and connection direction.  PUB can connect to
>>> > SUB, and vice versa.  In fact a single socket can bind and connect at
>>> > the same time.
>>> >
>>> >
>>> > It may also be unnecessary, because if the controller comes up at the
>>> > same endpoint(s), then zeromq handles all the reconnects invisibly.
>>> A
>>> > connection to an endpoint is always valid, whether or not there is a
>>> > socket present at any given point in time.
>>> >
>>> >
>>> >         On Sun, 2012-02-12 at 12:06 -0800, MinRK wrote:
>>> >         >
>>> >         >
>>> >         > On Sun, Feb 12, 2012 at 11:48, Darren Govoni
>>> >         <darren@ontrenet.com>
>>> >         > wrote:
>>> >         >         On Sun, 2012-02-12 at 11:12 -0800, MinRK wrote:
>>> >         >         >
>>> >         >         >
>>> >         >         > On Sun, Feb 12, 2012 at 10:42, Darren Govoni
>>> >         >         <darren@ontrenet.com>
>>> >         >         > wrote:
>>> >         >         >         Thanks Min,
>>> >         >         >
>>> >         >         >         Is it possible to open a ticket for this
>>> >         capability
>>> >         >         for a
>>> >         >         >         (near) future
>>> >         >         >         release? It compliments that already
>>> >         amazing load
>>> >         >         balancing
>>> >         >         >         capability.
>>> >         >         >
>>> >         >         >
>>> >         >         > You are welcome to open an Issue.  I don't know
>>> if
>>> >         it will
>>> >         >         make it
>>> >         >         > into one of the next few releases, but it is on
>>> my
>>> >         todo
>>> >         >         list.  The
>>> >         >         > best way to get this sort of thing going is to
>>> >         start with a
>>> >         >         Pull
>>> >         >         > Request.
>>> >         >
>>> >         >
>>> >         >         Ok, I will open an issue. Thanks. In the meantime,
>>> >         is it
>>> >         >         possible for
>>> >         >         clients to 'know' when a controller is no longer
>>> >         available?
>>> >         >         For example,
>>> >         >         it would be nice if I can insert a callback handler
>>> >         for this
>>> >         >         sort of
>>> >         >         internal exception so I can provide some graceful
>>> >         recovery
>>> >         >         options.
>>> >         >
>>> >         >
>>> >         > It would be sensible to add a heartbeat mechanism on the
>>> >         > controller->client PUB channel for this information.  Until
>>> >         then, your
>>> >         > main controller crash detection is going to be simple
>>> >         timeouts.
>>> >         >
>>> >         >
>>> >         > ZeroMQ makes disconnect detection a challenge (because
>>> there
>>> >         are no
>>> >         > disconnect events, because a disconnected channel is still
>>> >         valid, as
>>> >         > the peer is allowed to just come back up).
>>> >         >
>>> >         >
>>> >         >         >
>>> >         >         >
>>> >         >         >         Perhaps a related but separate notion
>>> >         would be the
>>> >         >         ability to
>>> >         >         >         have
>>> >         >         >         clustered controllers for HA.
>>> >         >         >
>>> >         >         >
>>> >         >         > I do have a model in mind for this sort of thing,
>>> >         though not
>>> >         >         multiple
>>> >         >         > *controllers*, rather multiple Schedulers.  Our
>>> >         design with
>>> >         >         0MQ would
>>> >         >         > make this pretty simple (just start another
>>> >         scheduler, and
>>> >         >         make an
>>> >         >         > extra call to socket.connect() on the Client and
>>> >         Engine is
>>> >         >         all that's
>>> >         >         > needed), and this should allow scaling to tens of
>>> >         thousands
>>> >         >         of
>>> >         >         > engines.
>>> >         >
>>> >         >
>>> >         >         Yes! That's what I'm after. In this cloud-scale age
>>> >         of
>>> >         >         computing, that
>>> >         >         would be ideal.
>>> >         >
>>> >         >
>>> >         >         Thanks Min.
>>> >         >
>>> >         >         >
>>> >         >         >
>>> >         >         >         On Sun, 2012-02-12 at 08:32 -0800, Min RK
>>> >         wrote:
>>> >         >         >         > No, there is no failover mechanism.
>>> >          When the
>>> >         >         controller
>>> >         >         >         goes down, further requests will simply
>>> >         hang.  We
>>> >         >         have almost
>>> >         >         >         all the information we need to bring up a
>>> >         new
>>> >         >         controller in
>>> >         >         >         its place (restart it), in which case the
>>> >         Client
>>> >         >         wouldn't even
>>> >         >         >         need to know that it went down, and would
>>> >         continue
>>> >         >         to just
>>> >         >         >         work, thanks to some zeromq magic.
>>> >         >         >         >
>>> >         >         >         > -MinRK
>>> >         >         >         >
>>> >         >         >         > On Feb 12, 2012, at 5:02, Darren Govoni
>>> >         >         >         <darren@ontrenet.com> wrote:
>>> >         >         >         >
>>> >         >         >         > > Hi,
>>> >         >         >         > >  Does ipython support any kind of
>>> >         clustering or
>>> >         >         failover
>>> >         >         >         for
>>> >         >         >         > > ipcontrollers? I'm wondering how
>>> >         situations are
>>> >         >         handled
>>> >         >         >         where a
>>> >         >         >         > > controller goes down when a client
>>> >         needs to
>>> >         >         perform
>>> >         >         >         something.
>>> >         >         >         > >
>>> >         >         >         > > thanks for any tips.
>>> >         >         >         > > Darren
>>> >         >         >         > >
>>> >         >         >         > >
>>> >         _______________________________________________
>>> >         >         >         > > IPython-User mailing list
>>> >         >         >         > > IPython-User@scipy.org
>>> >         >         >         > >
>>> >         >         http://mail.scipy.org/mailman/listinfo/ipython-user
>>> >         >         >         >
>>> >         _______________________________________________
>>> >         >         >         > IPython-User mailing list
>>> >         >         >         > IPython-User@scipy.org
>>> >         >         >         >
>>> >         >         http://mail.scipy.org/mailman/listinfo/ipython-user
>>> >         >         >
>>> >         >         >
>>> >         >         >
>>> >         _______________________________________________
>>> >         >         >         IPython-User mailing list
>>> >         >         >         IPython-User@scipy.org
>>> >         >         >
>>> >         http://mail.scipy.org/mailman/listinfo/ipython-user
>>> >         >         >
>>> >         >         >
>>> >         >         > _______________________________________________
>>> >         >         > IPython-User mailing list
>>> >         >         > IPython-User@scipy.org
>>> >         >         >
>>> >         http://mail.scipy.org/mailman/listinfo/ipython-user
>>> >         >
>>> >         >
>>> >         >         _______________________________________________
>>> >         >         IPython-User mailing list
>>> >         >         IPython-User@scipy.org
>>> >         >         http://mail.scipy.org/mailman/listinfo/ipython-user
>>> >         >
>>> >         >
>>> >         > _______________________________________________
>>> >         > IPython-User mailing list
>>> >         > IPython-User@scipy.org
>>> >         > http://mail.scipy.org/mailman/listinfo/ipython-user
>>> >
>>> >
>>> >         _______________________________________________
>>> >         IPython-User mailing list
>>> >         IPython-User@scipy.org
>>> >         http://mail.scipy.org/mailman/listinfo/ipython-user
>>> >
>>> >
>>> > _______________________________________________
>>> > IPython-User mailing list
>>> > IPython-User@scipy.org
>>> > http://mail.scipy.org/mailman/listinfo/ipython-user
>>>
>>>
>>> _______________________________________________
>>> IPython-User mailing list
>>> IPython-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/ipython-user
>>>
>>
>>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>



More information about the IPython-User mailing list