Can I ask more about what your environment is like, and the typical circumstances of controller shutdown / crash?<div><br></div><div>How often does the controller die, how many tasks are pending in the Schedulers, and how many are active on engines when this happens? What are your expectations/hopes/dreams for behavior if the controller goes down while a bunch of work is in-flight?</div>
<div><br></div><div>-MinRK<br><br><div class="gmail_quote">On Tue, Mar 6, 2012 at 13:20, <span dir="ltr"><<a href="mailto:darren@ontrenet.com">darren@ontrenet.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Wow. Awesome. Let me try it. Many thanks.<br>
<div class="HOEnZb"><div class="h5"><br>
> You might check out this first-go implementation:<br>
><br>
> <a href="https://github.com/ipython/ipython/pull/1471" target="_blank">https://github.com/ipython/ipython/pull/1471</a><br>
><br>
> It seems to work fine if the cluster was idle at controller crash, but I<br>
> haven't tested the behavior of running jobs. I'm certain that the<br>
> propagation of results of jobs submitted before shutdown all the way up to<br>
> interactive Clients is broken, but the results should still arrive in the<br>
> Hub's db.<br>
><br>
> -MinRK<br>
><br>
><br>
> On Mon, Mar 5, 2012 at 16:38, MinRK <<a href="mailto:benjaminrk@gmail.com">benjaminrk@gmail.com</a>> wrote:<br>
><br>
>> Correct, engines do not reconnect to a new controller, and right now a<br>
>> Controller is a single point of failure.<br>
>><br>
>> We absolutely do intend to enable restarting the controller, and it<br>
>> wouldn't be remotely difficult, the code just isn't written yet.<br>
>><br>
>> Steps required for this:<br>
>><br>
>> 1. persist engine connection state to files/db (engine ID/UUID mapping<br>
>> should)<br>
>> 2. when starting up, load this information into the Hub, instead of<br>
>> starting from scratch<br>
>><br>
>> That is all. No change should be required in the engines or clients, as<br>
>> zeromq handles the reconnect automagically.<br>
>><br>
>> There is already enough information stored in the *task* database to<br>
>> resume all tasks that were waiting in the Scheduler, but I'm not sure<br>
>> whether this should be done by default, or only on request.<br>
>><br>
>> -MinRK<br>
>><br>
>> On Mon, Mar 5, 2012 at 15:17, Darren Govoni <<a href="mailto:darren@ontrenet.com">darren@ontrenet.com</a>> wrote:<br>
>><br>
>>> Hi,<br>
>>><br>
>>> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:<br>
>>> > It may also be unnecessary, because if the controller comes up at the<br>
>>> > same endpoint(s), then zeromq handles all the reconnects invisibly.<br>
>>> A<br>
>>> > connection to an endpoint is always valid, whether or not there is a<br>
>>> > socket present at any given point in time.<br>
>>><br>
>>> I tried an example to see this. I ran an ipcontroller on one machine<br>
>>> with static --port=21001 so engine client files would always be valid.<br>
>>><br>
>><br>
>> Just specifying the registration port isn't enough information, and you<br>
>> should be using `--reuse` or `IPControllerApp.reuse_files=True` for<br>
>> connection files to remain valid across sessions.<br>
>><br>
>><br>
>>><br>
>>> I connected one engine from another server.<br>
>>><br>
>>> I killed the controller and restarted it.<br>
>>><br>
>>> After doing:<br>
>>><br>
>>> client = Client()<br>
>>> client.ids<br>
>>> []<br>
>>><br>
>>> There are no longer any engines connected.<br>
>>><br>
>>> dview = client[:]<br>
>>> ...<br>
>>> NoEnginesRegistered: Can't build targets without any engines<br>
>>><br>
>>> The problem perhaps is that for any large scale system, say 1<br>
>>> controller<br>
>>> with 50 engines running on 50 servers, this single-point-of-failure is<br>
>>> hard to remedy.<br>
>>><br>
>>> Is there a way to tell the controller to reconnect to last known engine<br>
>>> IP addresses? Or some other way to re-establish the grid? Rebooting 50<br>
>>> servers is not a good option for us.<br>
>>><br>
>>> On Sun, 2012-02-12 at 13:19 -0800, MinRK wrote:<br>
>>> ><br>
>>> ><br>
>>> > On Sun, Feb 12, 2012 at 13:02, Darren Govoni <<a href="mailto:darren@ontrenet.com">darren@ontrenet.com</a>><br>
>>> > wrote:<br>
>>> > Correct me if I'm wrong, but do the ipengines 'connect' or<br>
>>> > otherwise<br>
>>> > announce their presence to the controller?<br>
>>> ><br>
>>> ><br>
>>> > Yes, 100% of the connections are inbound to the controller processes,<br>
>>> > from clients and engines alike. This is a strict requirement,<br>
>>> because<br>
>>> > it would not be acceptable for engines to need open ports for inbound<br>
>>> > connections. Simply bringing up a new controller with the same<br>
>>> > connection information would result in the cluster continuing to<br>
>>> > function, with the engines and client never realizing the controller<br>
>>> > went down at all, nor having to act on it in any way.<br>
>>> ><br>
>>> > If it were the other way<br>
>>> > around, then this would accommodate some degree of fault<br>
>>> > tolerance for<br>
>>> > the controller because it could be restarted by a watching<br>
>>> dog<br>
>>> > and the<br>
>>> > re-establish the connected state of the cluster. i.e. a<br>
>>> > controller comes<br>
>>> > online. a pub/sub message is sent to a known channel and<br>
>>> > clients or<br>
>>> > engines add the new ipcontroller to its internal list as a<br>
>>> > failover<br>
>>> > endpoint.<br>
>>> ><br>
>>> ><br>
>>> > This is still possible without reversing connection direction. Note<br>
>>> > that in zeromq there is *exactly zero* correlation between<br>
>>> > communication direction and connection direction. PUB can connect to<br>
>>> > SUB, and vice versa. In fact a single socket can bind and connect at<br>
>>> > the same time.<br>
>>> ><br>
>>> ><br>
>>> > It may also be unnecessary, because if the controller comes up at the<br>
>>> > same endpoint(s), then zeromq handles all the reconnects invisibly.<br>
>>> A<br>
>>> > connection to an endpoint is always valid, whether or not there is a<br>
>>> > socket present at any given point in time.<br>
>>> ><br>
>>> ><br>
>>> > On Sun, 2012-02-12 at 12:06 -0800, MinRK wrote:<br>
>>> > ><br>
>>> > ><br>
>>> > > On Sun, Feb 12, 2012 at 11:48, Darren Govoni<br>
>>> > <<a href="mailto:darren@ontrenet.com">darren@ontrenet.com</a>><br>
>>> > > wrote:<br>
>>> > > On Sun, 2012-02-12 at 11:12 -0800, MinRK wrote:<br>
>>> > > ><br>
>>> > > ><br>
>>> > > > On Sun, Feb 12, 2012 at 10:42, Darren Govoni<br>
>>> > > <<a href="mailto:darren@ontrenet.com">darren@ontrenet.com</a>><br>
>>> > > > wrote:<br>
>>> > > > Thanks Min,<br>
>>> > > ><br>
>>> > > > Is it possible to open a ticket for this<br>
>>> > capability<br>
>>> > > for a<br>
>>> > > > (near) future<br>
>>> > > > release? It compliments that already<br>
>>> > amazing load<br>
>>> > > balancing<br>
>>> > > > capability.<br>
>>> > > ><br>
>>> > > ><br>
>>> > > > You are welcome to open an Issue. I don't know<br>
>>> if<br>
>>> > it will<br>
>>> > > make it<br>
>>> > > > into one of the next few releases, but it is on<br>
>>> my<br>
>>> > todo<br>
>>> > > list. The<br>
>>> > > > best way to get this sort of thing going is to<br>
>>> > start with a<br>
>>> > > Pull<br>
>>> > > > Request.<br>
>>> > ><br>
>>> > ><br>
>>> > > Ok, I will open an issue. Thanks. In the meantime,<br>
>>> > is it<br>
>>> > > possible for<br>
>>> > > clients to 'know' when a controller is no longer<br>
>>> > available?<br>
>>> > > For example,<br>
>>> > > it would be nice if I can insert a callback handler<br>
>>> > for this<br>
>>> > > sort of<br>
>>> > > internal exception so I can provide some graceful<br>
>>> > recovery<br>
>>> > > options.<br>
>>> > ><br>
>>> > ><br>
>>> > > It would be sensible to add a heartbeat mechanism on the<br>
>>> > > controller->client PUB channel for this information. Until<br>
>>> > then, your<br>
>>> > > main controller crash detection is going to be simple<br>
>>> > timeouts.<br>
>>> > ><br>
>>> > ><br>
>>> > > ZeroMQ makes disconnect detection a challenge (because<br>
>>> there<br>
>>> > are no<br>
>>> > > disconnect events, because a disconnected channel is still<br>
>>> > valid, as<br>
>>> > > the peer is allowed to just come back up).<br>
>>> > ><br>
>>> > ><br>
>>> > > ><br>
>>> > > ><br>
>>> > > > Perhaps a related but separate notion<br>
>>> > would be the<br>
>>> > > ability to<br>
>>> > > > have<br>
>>> > > > clustered controllers for HA.<br>
>>> > > ><br>
>>> > > ><br>
>>> > > > I do have a model in mind for this sort of thing,<br>
>>> > though not<br>
>>> > > multiple<br>
>>> > > > *controllers*, rather multiple Schedulers. Our<br>
>>> > design with<br>
>>> > > 0MQ would<br>
>>> > > > make this pretty simple (just start another<br>
>>> > scheduler, and<br>
>>> > > make an<br>
>>> > > > extra call to socket.connect() on the Client and<br>
>>> > Engine is<br>
>>> > > all that's<br>
>>> > > > needed), and this should allow scaling to tens of<br>
>>> > thousands<br>
>>> > > of<br>
>>> > > > engines.<br>
>>> > ><br>
>>> > ><br>
>>> > > Yes! That's what I'm after. In this cloud-scale age<br>
>>> > of<br>
>>> > > computing, that<br>
>>> > > would be ideal.<br>
>>> > ><br>
>>> > ><br>
>>> > > Thanks Min.<br>
>>> > ><br>
>>> > > ><br>
>>> > > ><br>
>>> > > > On Sun, 2012-02-12 at 08:32 -0800, Min RK<br>
>>> > wrote:<br>
>>> > > > > No, there is no failover mechanism.<br>
>>> > When the<br>
>>> > > controller<br>
>>> > > > goes down, further requests will simply<br>
>>> > hang. We<br>
>>> > > have almost<br>
>>> > > > all the information we need to bring up a<br>
>>> > new<br>
>>> > > controller in<br>
>>> > > > its place (restart it), in which case the<br>
>>> > Client<br>
>>> > > wouldn't even<br>
>>> > > > need to know that it went down, and would<br>
>>> > continue<br>
>>> > > to just<br>
>>> > > > work, thanks to some zeromq magic.<br>
>>> > > > ><br>
>>> > > > > -MinRK<br>
>>> > > > ><br>
>>> > > > > On Feb 12, 2012, at 5:02, Darren Govoni<br>
>>> > > > <<a href="mailto:darren@ontrenet.com">darren@ontrenet.com</a>> wrote:<br>
>>> > > > ><br>
>>> > > > > > Hi,<br>
>>> > > > > > Does ipython support any kind of<br>
>>> > clustering or<br>
>>> > > failover<br>
>>> > > > for<br>
>>> > > > > > ipcontrollers? I'm wondering how<br>
>>> > situations are<br>
>>> > > handled<br>
>>> > > > where a<br>
>>> > > > > > controller goes down when a client<br>
>>> > needs to<br>
>>> > > perform<br>
>>> > > > something.<br>
>>> > > > > ><br>
>>> > > > > > thanks for any tips.<br>
>>> > > > > > Darren<br>
>>> > > > > ><br>
>>> > > > > ><br>
>>> > _______________________________________________<br>
>>> > > > > > IPython-User mailing list<br>
>>> > > > > > <a href="mailto:IPython-User@scipy.org">IPython-User@scipy.org</a><br>
>>> > > > > ><br>
>>> > > <a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
>>> > > > ><br>
>>> > _______________________________________________<br>
>>> > > > > IPython-User mailing list<br>
>>> > > > > <a href="mailto:IPython-User@scipy.org">IPython-User@scipy.org</a><br>
>>> > > > ><br>
>>> > > <a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
>>> > > ><br>
>>> > > ><br>
>>> > > ><br>
>>> > _______________________________________________<br>
>>> > > > IPython-User mailing list<br>
>>> > > > <a href="mailto:IPython-User@scipy.org">IPython-User@scipy.org</a><br>
>>> > > ><br>
>>> > <a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
>>> > > ><br>
>>> > > ><br>
>>> > > > _______________________________________________<br>
>>> > > > IPython-User mailing list<br>
>>> > > > <a href="mailto:IPython-User@scipy.org">IPython-User@scipy.org</a><br>
>>> > > ><br>
>>> > <a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
>>> > ><br>
>>> > ><br>
>>> > > _______________________________________________<br>
>>> > > IPython-User mailing list<br>
>>> > > <a href="mailto:IPython-User@scipy.org">IPython-User@scipy.org</a><br>
>>> > > <a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
>>> > ><br>
>>> > ><br>
>>> > > _______________________________________________<br>
>>> > > IPython-User mailing list<br>
>>> > > <a href="mailto:IPython-User@scipy.org">IPython-User@scipy.org</a><br>
>>> > > <a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
>>> ><br>
>>> ><br>
>>> > _______________________________________________<br>
>>> > IPython-User mailing list<br>
>>> > <a href="mailto:IPython-User@scipy.org">IPython-User@scipy.org</a><br>
>>> > <a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
>>> ><br>
>>> ><br>
>>> > _______________________________________________<br>
>>> > IPython-User mailing list<br>
>>> > <a href="mailto:IPython-User@scipy.org">IPython-User@scipy.org</a><br>
>>> > <a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
>>><br>
>>><br>
>>> _______________________________________________<br>
>>> IPython-User mailing list<br>
>>> <a href="mailto:IPython-User@scipy.org">IPython-User@scipy.org</a><br>
>>> <a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
>>><br>
>><br>
>><br>
> _______________________________________________<br>
> IPython-User mailing list<br>
> <a href="mailto:IPython-User@scipy.org">IPython-User@scipy.org</a><br>
> <a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
><br>
<br>
_______________________________________________<br>
IPython-User mailing list<br>
<a href="mailto:IPython-User@scipy.org">IPython-User@scipy.org</a><br>
<a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
</div></div></blockquote></div><br></div>