[IPython-User] Using IPython as a Batch Queue
MinRK
benjaminrk@gmail....
Mon Jan 23 18:53:11 CST 2012
On Sun, Jan 22, 2012 at 22:53, Erik Petigura <eptune@gmail.com> wrote:
> Dear Wes and Min,
>
> Thanks for the suggestions regarding other programs for managing batch
> submission. If it's okay, I'd like to understand a bit more what's going on
> in the IPython framework.
>
>
>
> Periodically, one of my cores drops out.
>
>
> Can you explain this one? Is there any indication as to why one of
>
> your engines fails? It's possible this an erroneous heart failure,
>
> which can be alleviated by relaxing the heartbeat period to 5-10
>
> seconds with:
>
>
> c.HeartMonitor.period = 10
>
>
> in your ipcontroller_config.py
>
>
>
>
>
> Here is a `ps aux' dump of what's going on. I cleaned up the paths for
> readability.
>
>
>
> USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME
> COMMAND
> petigura 56013 100.0 0.3 2634324 53768 s003 R+ 10:26PM 0:35.62
> python val2134.py
> petigura 55962 99.0 0.3 2652864 55496 s003 R+ 10:26PM 1:13.14
> python val2140.py
> petigura 56025 99.0 0.3 2635648 53692 s003 R+ 10:26PM 0:28.09
> python val2139.py
> petigura 55812 98.5 0.3 2653816 62736 s003 R+ 10:24PM 2:36.85
> python val2135.py
> petigura 38665 22.6 0.5 2699096 99376 s002 R+ 12:17PM 82:11.48
> python ipython --pylab
> petigura 44579 0.3 0.2 2559724 33472 s003 S+ 3:30PM 2:15.77
> python ipcluster start --n=8
> petigura 44584 0.1 0.3 2643632 61900 s003 S+ 3:30PM 1:07.71
> python ipcontrollerapp.py --profile-dir
> /Users/petigura/.ipython/profile_default --log-to-file --log-level=20
> petigura 53491 0.0 0.0 2666688 432 s003 S+ 9:17PM 0:00.00
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44596 0.0 0.3 2666688 55640 s003 S+ 3:30PM 0:06.63
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44595 0.0 0.3 2665664 55680 s003 S+ 3:30PM 0:06.88
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44594 0.0 0.3 2666688 55636 s003 S+ 3:30PM 0:07.32
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44593 0.0 0.3 2666688 55676 s003 S+ 3:30PM 0:07.19
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44592 0.0 0.3 2664640 55668 s003 S+ 3:30PM 0:07.60
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44591 0.0 0.3 2665664 55776 s003 S+ 3:30PM 0:07.96
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44590 0.0 0.3 2665664 55680 s003 S+ 3:30PM 0:07.72
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44589 0.0 0.3 2664640 55676 s003 S+ 3:30PM 0:08.31
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 44588 0.0 0.2 2635272 39724 s003 S+ 3:30PM 0:25.99
> python ipcontrollerapp.py --profile-dir
> /Users/petigura/.ipython/profile_default --log-to-file --log-level=20
> petigura 44587 0.0 0.0 2623100 2844 s003 S+ 3:30PM 0:00.01
> python ipcontrollerapp.py --profile-dir
> /Users/petigura/.ipython/profile_default --log-to-file --log-level=20
> petigura 44586 0.0 0.0 2623100 2708 s003 S+ 3:30PM 0:00.01
> python ipcontrollerapp.py --profile-dir
> /Users/petigura/.ipython/profile_default --log-to-file --log-level=20
> petigura 44585 0.0 0.0 2614908 2752 s003 S+ 3:30PM 0:00.01
> python ipcontrollerapp.py --profile-dir
> /Users/petigura/.ipython/profile_default --log-to-file --log-level=20
> petigura 56024 0.0 0.0 2435544 808 s003 S+ 10:26PM 0:00.01
> /bin/sh -c python val2139.py > val2139.log
> petigura 56012 0.0 0.0 2435544 808 s003 S+ 10:26PM 0:00.01
> /bin/sh -c python val2134.py > val2134.log
> petigura 55961 0.0 0.0 2435544 808 s003 S+ 10:26PM 0:00.01
> /bin/sh -c python val2140.py > val2140.log
> petigura 55811 0.0 0.0 2435544 808 s003 S+ 10:24PM 0:00.01
> /bin/sh -c python val2135.py > val2135.log
> petigura 53728 0.0 0.0 2666688 428 s003 S+ 9:31PM 0:00.00
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 53673 0.0 0.0 2665664 420 s003 S+ 9:27PM 0:00.00
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
> petigura 53670 0.0 0.0 2665664 432 s003 S+ 9:27PM 0:00.00
> python ipengineapp.py --profile-dir /Users/petigura/.ipython/profile_default
> --log-to-file --log-level=20
>
>
> Here are some observations:
>
> 1. 8 instances of ipengineapp.py were started when I started my jobs at
> 3:30pm.
> 2. Around 9:30pm, 4 of the cores stopped working and 4 new instances
> of ipengineapp.py were started.
> 3. Now only 4 cores were working.
>
>
> What exactly does the heartbeat do?
The heartbeat is how IPython keeps track of which engines are alive.
ZeroMQ has no notion of disconnects, so detection of peers must be
handled in a separate application channel. They are simple (GIL-less)
zmq threads on the engines that respond to pings from the Hub, and
when an engine goes down, the heartbeat stops responding and the
engine is treated as dead (work is no longer assigned to it, and work
outstanding on the engine at that time is considered to have failed).
> Why would an engine work for many hours before dropping out?
If I understand your ps output correctly, your engines that are absent
from the cluster are still running? This suggests an erroneous heart
failure, due to a bug in the heartbeat system or general network or
system load. I've made some recent bugfixes that should address
possible causes here, but the general answer is that if heartbeats are
failing incorrectly, then your system is not performing up to snuff
with the heartbeats tolerances. This is alleviated by relaxing the
heartbeat timeouts to 5-10 seconds like I mentioned above.
I've recently made a few small changes to the heartbeat to fix a
couple of things that could cause incorrect delays in heartbeat
response, so it's possible that if you update to master you will not
see this issue anymore, but I might still recommend increasing the
heartbeat period if you are seeing heart failures.
>
> Thanks,
>
> Erik
>
>
>
>
More information about the IPython-User
mailing list