[IPython-User] engines dying due to excessive load

Robert Nishihara robertnishihara@gmail....
Thu Jun 28 17:28:24 CDT 2012

I've been trying to figure this out for a couple days now, and I'm curious
if anyone has seen a similar problem.

My setup is

    ipcontroller --profile=sge
    ipcluster engines -n 100 --profile=sge

My script uses map_sync with a direct view. After running my script for a
couple minutes, the load on the compute nodes grows excessively high and
the scheduler starts suspending jobs, so some of the engines get suspended.
This causes my script to terminate with an error like the one below

    [Engine Exception]EngineError: Engine 1315 died while running task

The engine is numbered 1315 because I sometimes restart the engines without
restarting the controller.

Why would suspending an engine would cause my script to terminate instead
of simply forcing it to wait?

Why might the load be so high? Each node has 32 cores. At most twenty
engines are running on each node. Yet, sometimes several hundred processes
are vying for space on a given node (and I'm the only one using the
cluster). Could it be the queuing of messages or something?
