[IPython-User] engines dying due to excessive load

Robert Nishihara robertnishihara@gmail....
Fri Jun 29 11:22:58 CDT 2012


Thanks for the help!

On Fri, Jun 29, 2012 at 12:22 PM, Robert Nishihara <
robertnishihara@gmail.com> wrote:

> Ok, so numpy uses Intel's Math Kernel Library, which tries to
> automatically parallelize things, which, on a cluster, can cause problems
> for the scheduler.
>
> Setting MLK_NUM_THREADS=1 on the engines appears to have completely fixed
> the problem. In my script, I did this with the line
>
>     dview.execute("os.environ['MKL_NUM_THREADS']='1'")
>
> which stops the scheduler from suspending my jobs and also gives me a
> performance increase (presumably because the scheduler was unable to
> effectively handle the load).
>
> -Robert
>
>
> On Fri, Jun 29, 2012 at 12:19 AM, Robert Nishihara <
> robertnishihara@gmail.com> wrote:
>
>> I am using numpy all over the place, so I will investigate if that is the
>> issue.
>>
>>
>> On Thu, Jun 28, 2012 at 8:19 PM, MinRK <benjaminrk@gmail.com> wrote:
>>
>>>
>>>
>>> On Thu, Jun 28, 2012 at 5:04 PM, Bago <mrbago@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Jun 28, 2012 at 3:28 PM, Robert Nishihara <
>>>> robertnishihara@gmail.com> wrote:
>>>>
>>>>> I've been trying to figure this out for a couple days now, and I'm
>>>>> curious if anyone has seen a similar problem.
>>>>>
>>>>> My setup is
>>>>>
>>>>>     ipcontroller --profile=sge
>>>>>     ipcluster engines -n 100 --profile=sge
>>>>>
>>>>> My script uses map_sync with a direct view. After running my script
>>>>> for a couple minutes, the load on the compute nodes grows excessively high
>>>>> and the scheduler starts suspending jobs, so some of the engines get
>>>>> suspended. This causes my script to terminate with an error like the one
>>>>> below
>>>>>
>>>>>     [Engine Exception]EngineError: Engine 1315 died while running task
>>>>> '966abf73-3183-4db3-8cf2-96bd08c2312b'
>>>>>
>>>>> The engine is numbered 1315 because I sometimes restart the engines
>>>>> without restarting the controller.
>>>>>
>>>>> Why would suspending an engine would cause my script to terminate
>>>>> instead of simply forcing it to wait?
>>>>>
>>>>> Why might the load be so high? Each node has 32 cores. At most twenty
>>>>> engines are running on each node. Yet, sometimes several hundred processes
>>>>> are vying for space on a given node (and I'm the only one using the
>>>>> cluster). Could it be the queuing of messages or something?
>>>>>
>>>>
>>>> This is a bit of shot in the dark, but on our machines we need to set *
>>>> *MKL_NUM_THREADS=1, otherwise some numpy functions (which I assume are
>>>> calling MKL functions) try and use 16 threads. Is it possible some of your
>>>> code, or some library you rely on, is mufti-threaded?
>>>>
>>>
>>> The only library *IPython* uses that is multithreaded in zeromq, but
>>> that's only one additional thread.  If *you* are using numpy, then the MKL
>>> environment is relevant.
>>>
>>>
>>>>
>>>>
>>>>> _______________________________________________
>>>>> IPython-User mailing list
>>>>> IPython-User@scipy.org
>>>>> http://mail.scipy.org/mailman/listinfo/ipython-user
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> IPython-User mailing list
>>>> IPython-User@scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/ipython-user
>>>>
>>>>
>>>
>>> _______________________________________________
>>> IPython-User mailing list
>>> IPython-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/ipython-user
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20120629/75e0871c/attachment-0001.html 


More information about the IPython-User mailing list