[IPython-User] engines dying due to excessive load
Thu Jun 28 19:19:34 CDT 2012
On Thu, Jun 28, 2012 at 5:04 PM, Bago <firstname.lastname@example.org> wrote:
> On Thu, Jun 28, 2012 at 3:28 PM, Robert Nishihara <
> email@example.com> wrote:
>> I've been trying to figure this out for a couple days now, and I'm
>> curious if anyone has seen a similar problem.
>> My setup is
>> ipcontroller --profile=sge
>> ipcluster engines -n 100 --profile=sge
>> My script uses map_sync with a direct view. After running my script for a
>> couple minutes, the load on the compute nodes grows excessively high and
>> the scheduler starts suspending jobs, so some of the engines get suspended.
>> This causes my script to terminate with an error like the one below
>> [Engine Exception]EngineError: Engine 1315 died while running task
>> The engine is numbered 1315 because I sometimes restart the engines
>> without restarting the controller.
>> Why would suspending an engine would cause my script to terminate instead
>> of simply forcing it to wait?
>> Why might the load be so high? Each node has 32 cores. At most twenty
>> engines are running on each node. Yet, sometimes several hundred processes
>> are vying for space on a given node (and I'm the only one using the
>> cluster). Could it be the queuing of messages or something?
> This is a bit of shot in the dark, but on our machines we need to set **
> MKL_NUM_THREADS=1, otherwise some numpy functions (which I assume are
> calling MKL functions) try and use 16 threads. Is it possible some of your
> code, or some library you rely on, is mufti-threaded?
The only library *IPython* uses that is multithreaded in zeromq, but that's
only one additional thread. If *you* are using numpy, then the MKL
environment is relevant.
>> IPython-User mailing list
> IPython-User mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the IPython-User