[IPython-user] How is a TaskClient " fault tolerant" ? And can it play nice with PBS queueing?

Jon Olav Vik jonovik@gmail....
Wed Feb 10 01:45:09 CST 2010


Brian Granger <ellisonbg.net <at> gmail.com> writes:

> I'm acquainting myself with parallel IPython and have a couple of questions.
> 1. Could someone please explain what it means that a TaskClient is "fault
> tolerant"?http://ipython.scipy.org/doc/stable/html/parallel/parallel_task.html
> 
> Sure.  Basically if a task fails for any reason (raises an exception, the 
engine dies, etc.),the task will be requeued and attempted again.  You can the 
number of retries with arguments to the task objects:http://
bazaar.launchpad.net/%7Eipython-dev/ipython/trunk/annotate/head%3A/IPython/
kernel/task.py#L268

Thank you for very enlightening answers!

I have made a "timeout" decorator based on signal.alarm() to prevent the 
processing of a task from taking forever. But perhaps there is already some 
timeout facility built in?

> I'm trying to follow the instructions athttp://ipython.scipy.org/doc/stable/
html/parallel/parallel_process.html#using-ipcluster-in-pbs-mode
> ipcluster pbs -n 8 --pbs-script=pbs.template &
> to run the following pbs script:
> #PBS -N ipython
> #PBS -j oe
> #PBS -l walltime=00:10:00
> #PBS -l nodes=${n/8}:ppn=8
> #PBS -q express
> cd $$PBS_O_WORKDIR
> mpiexec -n ${n} ipengine --logfile=$$PBS_O_WORKDIR/ipengine &
> sleep 30
> python ipar.py
> ...where ipar.py starts a MultiEngineClient and execute()'s commands that use
> MPI on the ipengines. (I haven't tried using it with a TaskClient yet.)
> 
> Is the "python ipar.py" in the PBS script?  If so, that is likely creating a 
problem... 

You suggested I started IPython interactively and play with the parallel 
facilities. I have, and it was great fun! However, I'm not sure how to transfer 
what I learned to a batch script. The "python ipar.py" following "mpiexec 
ipengine" was the best I could think of. My idea was that the ipar.py script 
would sort of take the place of the interactive session that you mentioned 
below.

> Note that I'm starting mpiexec in the background; otherwise, it would never
> finish and my Python script would never get called. Also, I'm backgrounding 
the
> call to ipcluster because that too never seems to finish. (Using mpiexec with
> "python ipar.py" does not seem to be required.)
> 
> Ah, I understand your issue.  The command ipcluster starts the engine and 
controller.  The controlleris started on the head node and each engine is 
started on a compute node.  But, the code that uses
> the MultiEngineClient only runs in 1 process, usually on the head node.  The 
script that uses MultiEngineClientitself is only run in serial.  But, it 
connects to the controller and submits tasks  which are then performed in 
parallel
> by the engines. I highly recommend first trying to run this in "local" mode 
on you workstation:ipcluster local -n 2Then fire up IPython, and create a 
multiengine client and play with it:http://ipython.scipy.org/doc/nightly/html/
parallel/parallel_multiengine.htmlLet us know how it goes.

Thanks for the explanation; I certainly need several attempts to wrap my head 
around this 8-)

So, how would this be for a batch script running under PBS? In the jargon of 
PBS, a "node" is a machine with e.g. eight "processors". Would the "head node" 
that you mention correspond to "rank 0", the root processor? And so the 
"compute nodes" would be all the other processors?

If so, do I understand correctly that:

Rank 0 (the root process) is the "head node" and runs the controller.
Ranks 1 <= i < MPI.Get_size() are "compute nodes" and run one engine each.
In addition, the main script like my ipar.py, which is used "instead of" an 
interactive session, runs as another process on rank 0?

How should I do this with PBS and mpiexec? From http://linux.die.net/man/1/
mpiexec it seems that mpiexec -np X programname will start X instances of 
"programname", by default starting on processor 0, 1, ... in a round-robin 
fashion. If so, would this do the trick?
 
mpiexec -np 1 ipcontroller
mpiexec -np $((n-1)) ipengine
mpiexec -np 1 python ipar.py

I'd hope that this starts the ipcontroller on rank 0, ipengines on 1, 2, 3, and 
ipar.py on rank 0 again (for n=4). Nothing would be backgrounded (by me, 
anyway), so the job system should have nothing to complain about. (Would 
"ipcluster -$((n-1))" be an alternative to the ipcontroller and ipengine 
commands?)

Thank you very much for your help.

Jon Olav

> However, the compute cluster's user instructions say I shouldn't start
> processes in the background, because then they escape the control of the job
> scheduler. Is there a way I can make TaskClient() work under this restriction?
> Otherwise, I'm just going to manually "killall ipcluster" etc. once my job is
> done. (Or maybe that could go as the last lines of my pbs script?)




More information about the IPython-user mailing list