[IPython-User] engines dying due to excessive load

Robert Nishihara robertnishihara@gmail....
Thu Jun 28 23:19:26 CDT 2012


I am using numpy all over the place, so I will investigate if that is the
issue.

On Thu, Jun 28, 2012 at 8:19 PM, MinRK <benjaminrk@gmail.com> wrote:

>
>
> On Thu, Jun 28, 2012 at 5:04 PM, Bago <mrbago@gmail.com> wrote:
>
>>
>>
>> On Thu, Jun 28, 2012 at 3:28 PM, Robert Nishihara <
>> robertnishihara@gmail.com> wrote:
>>
>>> I've been trying to figure this out for a couple days now, and I'm
>>> curious if anyone has seen a similar problem.
>>>
>>> My setup is
>>>
>>>     ipcontroller --profile=sge
>>>     ipcluster engines -n 100 --profile=sge
>>>
>>> My script uses map_sync with a direct view. After running my script for
>>> a couple minutes, the load on the compute nodes grows excessively high and
>>> the scheduler starts suspending jobs, so some of the engines get suspended.
>>> This causes my script to terminate with an error like the one below
>>>
>>>     [Engine Exception]EngineError: Engine 1315 died while running task
>>> '966abf73-3183-4db3-8cf2-96bd08c2312b'
>>>
>>> The engine is numbered 1315 because I sometimes restart the engines
>>> without restarting the controller.
>>>
>>> Why would suspending an engine would cause my script to terminate
>>> instead of simply forcing it to wait?
>>>
>>> Why might the load be so high? Each node has 32 cores. At most twenty
>>> engines are running on each node. Yet, sometimes several hundred processes
>>> are vying for space on a given node (and I'm the only one using the
>>> cluster). Could it be the queuing of messages or something?
>>>
>>
>> This is a bit of shot in the dark, but on our machines we need to set **
>> MKL_NUM_THREADS=1, otherwise some numpy functions (which I assume are
>> calling MKL functions) try and use 16 threads. Is it possible some of your
>> code, or some library you rely on, is mufti-threaded?
>>
>
> The only library *IPython* uses that is multithreaded in zeromq, but
> that's only one additional thread.  If *you* are using numpy, then the MKL
> environment is relevant.
>
>
>>
>>
>>> _______________________________________________
>>> IPython-User mailing list
>>> IPython-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/ipython-user
>>>
>>>
>>
>> _______________________________________________
>> IPython-User mailing list
>> IPython-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-user
>>
>>
>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20120629/0e63478e/attachment.html 


More information about the IPython-User mailing list