[IPython-User] ipcontroller failover?
darren@ontrene...
darren@ontrene...
Tue Mar 6 15:20:48 CST 2012
Wow. Awesome. Let me try it. Many thanks.
> You might check out this first-go implementation:
>
> https://github.com/ipython/ipython/pull/1471
>
> It seems to work fine if the cluster was idle at controller crash, but I
> haven't tested the behavior of running jobs. I'm certain that the
> propagation of results of jobs submitted before shutdown all the way up to
> interactive Clients is broken, but the results should still arrive in the
> Hub's db.
>
> -MinRK
>
>
> On Mon, Mar 5, 2012 at 16:38, MinRK <benjaminrk@gmail.com> wrote:
>
>> Correct, engines do not reconnect to a new controller, and right now a
>> Controller is a single point of failure.
>>
>> We absolutely do intend to enable restarting the controller, and it
>> wouldn't be remotely difficult, the code just isn't written yet.
>>
>> Steps required for this:
>>
>> 1. persist engine connection state to files/db (engine ID/UUID mapping
>> should)
>> 2. when starting up, load this information into the Hub, instead of
>> starting from scratch
>>
>> That is all. No change should be required in the engines or clients, as
>> zeromq handles the reconnect automagically.
>>
>> There is already enough information stored in the *task* database to
>> resume all tasks that were waiting in the Scheduler, but I'm not sure
>> whether this should be done by default, or only on request.
>>
>> -MinRK
>>
>> On Mon, Mar 5, 2012 at 15:17, Darren Govoni <darren@ontrenet.com> wrote:
>>
>>> Hi,
>>>
>>> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
>>> > It may also be unnecessary, because if the controller comes up at the
>>> > same endpoint(s), then zeromq handles all the reconnects invisibly.
>>> A
>>> > connection to an endpoint is always valid, whether or not there is a
>>> > socket present at any given point in time.
>>>
>>> I tried an example to see this. I ran an ipcontroller on one machine
>>> with static --port=21001 so engine client files would always be valid.
>>>
>>
>> Just specifying the registration port isn't enough information, and you
>> should be using `--reuse` or `IPControllerApp.reuse_files=True` for
>> connection files to remain valid across sessions.
>>
>>
>>>
>>> I connected one engine from another server.
>>>
>>> I killed the controller and restarted it.
>>>
>>> After doing:
>>>
>>> client = Client()
>>> client.ids
>>> []
>>>
>>> There are no longer any engines connected.
>>>
>>> dview = client[:]
>>> ...
>>> NoEnginesRegistered: Can't build targets without any engines
>>>
>>> The problem perhaps is that for any large scale system, say 1
>>> controller
>>> with 50 engines running on 50 servers, this single-point-of-failure is
>>> hard to remedy.
>>>
>>> Is there a way to tell the controller to reconnect to last known engine
>>> IP addresses? Or some other way to re-establish the grid? Rebooting 50
>>> servers is not a good option for us.
>>>
>>> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:
>>> >
>>> >
>>> > On Sun, Feb 12, 2012 at 13:02, Darren Govoni <darren@ontrenet.com>
>>> > wrote:
>>> > Correct me if I'm wrong, but do the ipengines 'connect' or
>>> > otherwise
>>> > announce their presence to the controller?
>>> >
>>> >
>>> > Yes, 100% of the connections are inbound to the controller processes,
>>> > from clients and engines alike. This is a strict requirement,
>>> because
>>> > it would not be acceptable for engines to need open ports for inbound
>>> > connections. Simply bringing up a new controller with the same
>>> > connection information would result in the cluster continuing to
>>> > function, with the engines and client never realizing the controller
>>> > went down at all, nor having to act on it in any way.
>>> >
>>> > If it were the other way
>>> > around, then this would accommodate some degree of fault
>>> > tolerance for
>>> > the controller because it could be restarted by a watching
>>> dog
>>> > and the
>>> > re-establish the connected state of the cluster. i.e. a
>>> > controller comes
>>> > online. a pub/sub message is sent to a known channel and
>>> > clients or
>>> > engines add the new ipcontroller to its internal list as a
>>> > failover
>>> > endpoint.
>>> >
>>> >
>>> > This is still possible without reversing connection direction. Note
>>> > that in zeromq there is *exactly zero* correlation between
>>> > communication direction and connection direction. PUB can connect to
>>> > SUB, and vice versa. In fact a single socket can bind and connect at
>>> > the same time.
>>> >
>>> >
>>> > It may also be unnecessary, because if the controller comes up at the
>>> > same endpoint(s), then zeromq handles all the reconnects invisibly.
>>> A
>>> > connection to an endpoint is always valid, whether or not there is a
>>> > socket present at any given point in time.
>>> >
>>> >
>>> > On Sun, 2012-02-12 at 12:06 -0800, MinRK wrote:
>>> > >
>>> > >
>>> > > On Sun, Feb 12, 2012 at 11:48, Darren Govoni
>>> > <darren@ontrenet.com>
>>> > > wrote:
>>> > > On Sun, 2012-02-12 at 11:12 -0800, MinRK wrote:
>>> > > >
>>> > > >
>>> > > > On Sun, Feb 12, 2012 at 10:42, Darren Govoni
>>> > > <darren@ontrenet.com>
>>> > > > wrote:
>>> > > > Thanks Min,
>>> > > >
>>> > > > Is it possible to open a ticket for this
>>> > capability
>>> > > for a
>>> > > > (near) future
>>> > > > release? It compliments that already
>>> > amazing load
>>> > > balancing
>>> > > > capability.
>>> > > >
>>> > > >
>>> > > > You are welcome to open an Issue. I don't know
>>> if
>>> > it will
>>> > > make it
>>> > > > into one of the next few releases, but it is on
>>> my
>>> > todo
>>> > > list. The
>>> > > > best way to get this sort of thing going is to
>>> > start with a
>>> > > Pull
>>> > > > Request.
>>> > >
>>> > >
>>> > > Ok, I will open an issue. Thanks. In the meantime,
>>> > is it
>>> > > possible for
>>> > > clients to 'know' when a controller is no longer
>>> > available?
>>> > > For example,
>>> > > it would be nice if I can insert a callback handler
>>> > for this
>>> > > sort of
>>> > > internal exception so I can provide some graceful
>>> > recovery
>>> > > options.
>>> > >
>>> > >
>>> > > It would be sensible to add a heartbeat mechanism on the
>>> > > controller->client PUB channel for this information. Until
>>> > then, your
>>> > > main controller crash detection is going to be simple
>>> > timeouts.
>>> > >
>>> > >
>>> > > ZeroMQ makes disconnect detection a challenge (because
>>> there
>>> > are no
>>> > > disconnect events, because a disconnected channel is still
>>> > valid, as
>>> > > the peer is allowed to just come back up).
>>> > >
>>> > >
>>> > > >
>>> > > >
>>> > > > Perhaps a related but separate notion
>>> > would be the
>>> > > ability to
>>> > > > have
>>> > > > clustered controllers for HA.
>>> > > >
>>> > > >
>>> > > > I do have a model in mind for this sort of thing,
>>> > though not
>>> > > multiple
>>> > > > *controllers*, rather multiple Schedulers. Our
>>> > design with
>>> > > 0MQ would
>>> > > > make this pretty simple (just start another
>>> > scheduler, and
>>> > > make an
>>> > > > extra call to socket.connect() on the Client and
>>> > Engine is
>>> > > all that's
>>> > > > needed), and this should allow scaling to tens of
>>> > thousands
>>> > > of
>>> > > > engines.
>>> > >
>>> > >
>>> > > Yes! That's what I'm after. In this cloud-scale age
>>> > of
>>> > > computing, that
>>> > > would be ideal.
>>> > >
>>> > >
>>> > > Thanks Min.
>>> > >
>>> > > >
>>> > > >
>>> > > > On Sun, 2012-02-12 at 08:32 -0800, Min RK
>>> > wrote:
>>> > > > > No, there is no failover mechanism.
>>> > When the
>>> > > controller
>>> > > > goes down, further requests will simply
>>> > hang. We
>>> > > have almost
>>> > > > all the information we need to bring up a
>>> > new
>>> > > controller in
>>> > > > its place (restart it), in which case the
>>> > Client
>>> > > wouldn't even
>>> > > > need to know that it went down, and would
>>> > continue
>>> > > to just
>>> > > > work, thanks to some zeromq magic.
>>> > > > >
>>> > > > > -MinRK
>>> > > > >
>>> > > > > On Feb 12, 2012, at 5:02, Darren Govoni
>>> > > > <darren@ontrenet.com> wrote:
>>> > > > >
>>> > > > > > Hi,
>>> > > > > > Does ipython support any kind of
>>> > clustering or
>>> > > failover
>>> > > > for
>>> > > > > > ipcontrollers? I'm wondering how
>>> > situations are
>>> > > handled
>>> > > > where a
>>> > > > > > controller goes down when a client
>>> > needs to
>>> > > perform
>>> > > > something.
>>> > > > > >
>>> > > > > > thanks for any tips.
>>> > > > > > Darren
>>> > > > > >
>>> > > > > >
>>> > _______________________________________________
>>> > > > > > IPython-User mailing list
>>> > > > > > IPython-User@scipy.org
>>> > > > > >
>>> > > http://mail.scipy.org/mailman/listinfo/ipython-user
>>> > > > >
>>> > _______________________________________________
>>> > > > > IPython-User mailing list
>>> > > > > IPython-User@scipy.org
>>> > > > >
>>> > > http://mail.scipy.org/mailman/listinfo/ipython-user
>>> > > >
>>> > > >
>>> > > >
>>> > _______________________________________________
>>> > > > IPython-User mailing list
>>> > > > IPython-User@scipy.org
>>> > > >
>>> > http://mail.scipy.org/mailman/listinfo/ipython-user
>>> > > >
>>> > > >
>>> > > > _______________________________________________
>>> > > > IPython-User mailing list
>>> > > > IPython-User@scipy.org
>>> > > >
>>> > http://mail.scipy.org/mailman/listinfo/ipython-user
>>> > >
>>> > >
>>> > > _______________________________________________
>>> > > IPython-User mailing list
>>> > > IPython-User@scipy.org
>>> > > http://mail.scipy.org/mailman/listinfo/ipython-user
>>> > >
>>> > >
>>> > > _______________________________________________
>>> > > IPython-User mailing list
>>> > > IPython-User@scipy.org
>>> > > http://mail.scipy.org/mailman/listinfo/ipython-user
>>> >
>>> >
>>> > _______________________________________________
>>> > IPython-User mailing list
>>> > IPython-User@scipy.org
>>> > http://mail.scipy.org/mailman/listinfo/ipython-user
>>> >
>>> >
>>> > _______________________________________________
>>> > IPython-User mailing list
>>> > IPython-User@scipy.org
>>> > http://mail.scipy.org/mailman/listinfo/ipython-user
>>>
>>>
>>> _______________________________________________
>>> IPython-User mailing list
>>> IPython-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/ipython-user
>>>
>>
>>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
More information about the IPython-User
mailing list