[IPython-User] engines dying due to excessive load

MinRK benjaminrk@gmail....
Thu Jun 28 18:19:37 CDT 2012


On Thu, Jun 28, 2012 at 3:28 PM, Robert Nishihara <robertnishihara@gmail.com
> wrote:

> I've been trying to figure this out for a couple days now, and I'm curious
> if anyone has seen a similar problem.
>
> My setup is
>
>     ipcontroller --profile=sge
>     ipcluster engines -n 100 --profile=sge
>
> My script uses map_sync with a direct view. After running my script for a
> couple minutes, the load on the compute nodes grows excessively high and
> the scheduler starts suspending jobs, so some of the engines get suspended.
> This causes my script to terminate with an error like the one below
>
>     [Engine Exception]EngineError: Engine 1315 died while running task
> '966abf73-3183-4db3-8cf2-96bd08c2312b'
>
> The engine is numbered 1315 because I sometimes restart the engines
> without restarting the controller.
>
> Why would suspending an engine would cause my script to terminate instead
> of simply forcing it to wait?
>

Heartbeats.  The engine has a heartbeat that the Hub uses to detect engine
death.  A suspended process would stop the heartbeat.


>
> Why might the load be so high? Each node has 32 cores. At most twenty
> engines are running on each node. Yet, sometimes several hundred processes
> are vying for space on a given node (and I'm the only one using the
> cluster). Could it be the queuing of messages or something?
>

This I have no idea - each engine is a single process, and the Controller
is a collection of five (four schedulers and the Hub).  What is your
workload?


>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20120628/3ad7f8d8/attachment.html 


More information about the IPython-User mailing list