[IPython-User] Using IPython as a Batch Queue

MinRK benjaminrk@gmail....
Mon Jan 23 18:53:11 CST 2012


On Sun, Jan 22, 2012 at 22:53, Erik Petigura <eptune@gmail.com> wrote:
> Dear Wes and Min,
>
> Thanks for the suggestions regarding other programs for managing batch
> submission.  If it's okay, I'd like to understand a bit more what's going on
> in the IPython framework.
>
>
>
> Periodically, one of my cores drops out.
>
>
> Can you explain this one? Is there any indication as to why one of
>
> your engines fails?  It's possible this an erroneous heart failure,
>
> which can be alleviated by relaxing the heartbeat period to 5-10
>
> seconds with:
>
>
> c.HeartMonitor.period = 10
>
>
> in your ipcontroller_config.py
>
>
>
>
>
> Here is a `ps aux' dump of what's going on.  I cleaned up the paths for
> readability.
>
>
>
> USER       PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME
> COMMAND
> petigura 56013 100.0  0.3  2634324  53768 s003  R+   10:26PM   0:35.62
> python val2134.py
> petigura 55962  99.0  0.3  2652864  55496 s003  R+   10:26PM   1:13.14
> python val2140.py
> petigura 56025  99.0  0.3  2635648  53692 s003  R+   10:26PM   0:28.09
> python val2139.py
> petigura 55812  98.5  0.3  2653816  62736 s003  R+   10:24PM   2:36.85
> python val2135.py
> petigura 38665  22.6  0.5  2699096  99376 s002  R+   12:17PM  82:11.48
> python ipython --pylab
> petigura 44579   0.3  0.2  2559724  33472 s003  S+    3:30PM   2:15.77
> python ipcluster start --n=8
> petigura 44584   0.1  0.3  2643632  61900 s003  S+    3:30PM   1:07.71
> python ipcontrollerapp.py --profile-dir
> /Users/petigura/.ipython/profile_default --log-to-file --log-level=20
> petigura 53491   0.0  0.0  2666688    432 s003  S+    9:17PM   0:00.00
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44596   0.0  0.3  2666688  55640 s003  S+    3:30PM   0:06.63
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44595   0.0  0.3  2665664  55680 s003  S+    3:30PM   0:06.88
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44594   0.0  0.3  2666688  55636 s003  S+    3:30PM   0:07.32
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44593   0.0  0.3  2666688  55676 s003  S+    3:30PM   0:07.19
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44592   0.0  0.3  2664640  55668 s003  S+    3:30PM   0:07.60
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44591   0.0  0.3  2665664  55776 s003  S+    3:30PM   0:07.96
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44590   0.0  0.3  2665664  55680 s003  S+    3:30PM   0:07.72
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44589   0.0  0.3  2664640  55676 s003  S+    3:30PM   0:08.31
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44588   0.0  0.2  2635272  39724 s003  S+    3:30PM   0:25.99
> python ipcontrollerapp.py --profile-dir
> /Users/petigura/.ipython/profile_default --log-to-file --log-level=20
> petigura 44587   0.0  0.0  2623100   2844 s003  S+    3:30PM   0:00.01
> python ipcontrollerapp.py --profile-dir
> /Users/petigura/.ipython/profile_default --log-to-file --log-level=20
> petigura 44586   0.0  0.0  2623100   2708 s003  S+    3:30PM   0:00.01
> python ipcontrollerapp.py --profile-dir
> /Users/petigura/.ipython/profile_default --log-to-file --log-level=20
> petigura 44585   0.0  0.0  2614908   2752 s003  S+    3:30PM   0:00.01
> python ipcontrollerapp.py --profile-dir
> /Users/petigura/.ipython/profile_default --log-to-file --log-level=20
> petigura 56024   0.0  0.0  2435544    808 s003  S+   10:26PM   0:00.01
> /bin/sh -c python val2139.py > val2139.log
> petigura 56012   0.0  0.0  2435544    808 s003  S+   10:26PM   0:00.01
> /bin/sh -c python val2134.py > val2134.log
> petigura 55961   0.0  0.0  2435544    808 s003  S+   10:26PM   0:00.01
> /bin/sh -c python val2140.py > val2140.log
> petigura 55811   0.0  0.0  2435544    808 s003  S+   10:24PM   0:00.01
> /bin/sh -c python val2135.py > val2135.log
> petigura 53728   0.0  0.0  2666688    428 s003  S+    9:31PM   0:00.00
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 53673   0.0  0.0  2665664    420 s003  S+    9:27PM   0:00.00
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 53670   0.0  0.0  2665664    432 s003  S+    9:27PM   0:00.00
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
>
>
> Here are some observations:
>
> 1. 8 instances of ipengineapp.py were started when I started my jobs at
> 3:30pm.
> 2. Around 9:30pm, 4 of the cores stopped working and 4 new instances
> of ipengineapp.py were started.
> 3. Now only 4 cores were working.
>
>
> What exactly does the heartbeat do?

The heartbeat is how IPython keeps track of which engines are alive.
ZeroMQ has no notion of disconnects, so detection of peers must be
handled in a separate application channel.  They are simple (GIL-less)
zmq threads on the engines that respond to pings from the Hub, and
when an engine goes down, the heartbeat stops responding and the
engine is treated as dead (work is no longer assigned to it, and work
outstanding on the engine at that time is considered to have failed).

> Why would an engine work for many hours before dropping out?

If I understand your ps output correctly, your engines that are absent
from the cluster are still running?  This suggests an erroneous heart
failure, due to a bug in the heartbeat system or general network or
system load.  I've made some recent bugfixes that should address
possible causes here, but the general answer is that if heartbeats are
failing incorrectly, then your system is not performing up to snuff
with the heartbeats tolerances.  This is alleviated by relaxing the
heartbeat timeouts to 5-10 seconds like I mentioned above.

I've recently made a few small changes to the heartbeat to fix a
couple of things that could cause incorrect delays in heartbeat
response, so it's possible that if you update to master you will not
see this issue anymore, but I might still recommend increasing the
heartbeat period if you are seeing heart failures.

>
> Thanks,
>
> Erik
>
>
>
>


More information about the IPython-User mailing list