[IPython-user] How is a TaskClient " fault tolerant" ? And can it play nice with PBS queueing?

Jon Olav Vik jonovik@gmail....
Wed Feb 10 08:56:55 CST 2010


Jon Olav Vik <jonovik <at> gmail.com> writes:
> How should I do this with PBS and mpiexec? From http://linux.die.net/man/1/
> mpiexec it seems that mpiexec -np X programname will start X instances of 
> "programname", by default starting on processor 0, 1, ... in a round-robin 
> fashion. If so, would this do the trick?
> 
> mpiexec -np 1 ipcontroller
> mpiexec -np $((n-1)) ipengine
> mpiexec -np 1 python ipar.py
> 
> I'd hope that this starts the ipcontroller on rank 0, ipengines on 1, 2, 3, 
and 
> ipar.py on rank 0 again (for n=4). Nothing would be backgrounded (by me, 
> anyway), so the job system should have nothing to complain about. (Would 
> "ipcluster -$((n-1))" be an alternative to the ipcontroller and ipengine 
> commands?)

I tried the following, which works outside the batch system (running on the 
cluster's login node), but not in my attempt at a PBS batch job.

Outside the batch system, I run
ipcluster local -n 4 &
sleep 10
python ipar.py

== ipar.py ==
from IPython.kernel import client
mec = client.MultiEngineClient()
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
name = MPI.Get_processor_name()
print "I am rank %s of %s on %s" % (rank, size, name)
@mec.parallel()
def f(x):
    return -x
print f(range(11))
mec.kill()

and the output is:

[jonvi@stallo-1 restitution]$ ipcluster local -n 4 &
[jonvi@stallo-1 restitution]$ sleep 10
2010-02-10 13:19:24+0100 [-] Log opened.
2010-02-10 13:19:24+0100 [-] Process ['ipcontroller', '--logfile=/home/
jonvi/.ipython/log/ipcontroller'] has started with pid=4546
2010-02-10 13:19:24+0100 [-] Waiting for controller to finish starting...
2010-02-10 13:19:27+0100 [-] Controller started
... Process ['ipengine', ...] has started with pid=4550...
2010-02-10 13:19:28+0100 [-] Engines started with pids: [4550, 4552, 4553, 4554]

I am rank 0 of 1 on stallo-1.local
[0, -1, -2, -3, -4, -5, -6, -7, -8, -9, -10]

2010-02-10 13:19:35+0100 [-] Process ['ipengine'...has stopped with 0
...

However, when I submit this script:

$ qsub ipar.sh

== ipar.sh ==
#!/bin/bash
# Based on http://docs.notur.no/uit/files-uit/runscript-example-stallo.sh/view
#PBS -lnodes=2:ppn=8 
#PBS -lwalltime=0:02:00
cd $PBS_O_WORKDIR
mpiexec -np 1 ipcontroller
sleep 30
mpiexec -np 15 ipengine --logfile=$$PBS_O_WORKDIR/ipengine
sleep 30
mpiexec -np 1 python ipar.py

The ipar.py never seems to execute, and the batch job gives the following 
output:
2010-02-10 13:09:49+0100 [-] Log opened.
2010-02-10 13:09:49+0100 [-] foolscap.pb.Listener starting on 44962
2010-02-10 13:09:49+0100 [-] foolscap.pb.Listener starting on 60817
2010-02-10 13:09:49+0100 [-] Adapting Controller to interface: multiengine
2010-02-10 13:09:49+0100 [-] Saving furl for interface [multiengine] to file: /
home/jonvi/.ipython/security/ipcontroller-mec.furl
2010-02-10 13:09:49+0100 [-] Adapting Controller to interface: task
2010-02-10 13:09:49+0100 [-] Saving furl for interface [task] to file: /home/
jonvi/.ipython/security/ipcontroller-tc.furl
2010-02-10 13:09:49+0100 [-] Saving furl for the engine to file: /home/
jonvi/.ipython/security/ipcontroller-engine.furl
2010-02-10 13:09:49+0100 [-] twisted.internet.protocol.DatagramProtocol 
starting on 48454
2010-02-10 13:09:49+0100 [-] Starting protocol 
<twisted.internet.protocol.DatagramProtocol instance at 0x20db3518>
2010-02-10 13:09:49+0100 [-] twisted.internet.protocol.DatagramProtocol 
starting on 48790
2010-02-10 13:09:49+0100 [-] Starting protocol 
<twisted.internet.protocol.DatagramProtocol instance at 0x20f52560>
2010-02-10 13:09:49+0100 [-] (Port 48454 Closed)
2010-02-10 13:09:49+0100 [-] Stopping protocol 
<twisted.internet.protocol.DatagramProtocol instance at 0x20db3518>
2010-02-10 13:09:49+0100 [-] (Port 48790 Closed)
2010-02-10 13:09:49+0100 [-] Stopping protocol 
<twisted.internet.protocol.DatagramProtocol instance at 0x20f52560>
2010-02-10 13:11:43+0100 [-] Received SIGTERM, shutting down.
...

Any further suggestions are welcome!

Jon Olav




More information about the IPython-user mailing list