[IPython-user] How is a TaskClient "fault tolerant"? And can it play nice with PBS queueing?

Brian Granger ellisonbg.net@gmail....
Tue Feb 9 22:32:57 CST 2010


Jon,

I'm acquainting myself with parallel IPython and have a couple of questions.
>
> 1. Could someone please explain what it means that a TaskClient is "fault
> tolerant"?
> http://ipython.scipy.org/doc/stable/html/parallel/parallel_task.html
>
>
Sure.  Basically if a task fails for any reason (raises an exception, the
engine dies, etc.),
the task will be requeued and attempted again.  You can the number of
retries
with arguments to the task objects:

http://bazaar.launchpad.net/%7Eipython-dev/ipython/trunk/annotate/head%3A/IPython/kernel/task.py#L268


> 2. The task interface sounds useful for embarrassingly parallel
> computations.
>

Yes, that is definitely true.


> I'm trying to follow the instructions at
>
> http://ipython.scipy.org/doc/stable/html/parallel/parallel_process.html#using-
> ipcluster-in-pbs-mode<http://ipython.scipy.org/doc/stable/html/parallel/parallel_process.html#using-%0Aipcluster-in-pbs-mode>
> (PBS is the queueing system used by the computer cluster I'm working with).
>
> I use the command
> ipcluster pbs -n 8 --pbs-script=pbs.template &
> to run the following pbs script:
>
> #PBS -N ipython
> #PBS -j oe
> #PBS -l walltime=00:10:00
> #PBS -l nodes=${n/8}:ppn=8
> #PBS -q express
> cd $$PBS_O_WORKDIR
> mpiexec -n ${n} ipengine --logfile=$$PBS_O_WORKDIR/ipengine &
> sleep 30
> python ipar.py
>
> ...where ipar.py starts a MultiEngineClient and execute()'s commands that
> use
> MPI on the ipengines. (I haven't tried using it with a TaskClient yet.)
>

Is the "python ipar.py" in the PBS script?  If so, that is likely creating a
problem...


> Note that I'm starting mpiexec in the background; otherwise, it would never
> finish and my Python script would never get called. Also, I'm backgrounding
> the
> call to ipcluster because that too never seems to finish. (Using mpiexec
> with
> "python ipar.py" does not seem to be required.)
>
> Ah, I understand your issue.  The command ipcluster starts the engine and
controller.  The controller
is started on the head node and each engine is started on a compute node.
But, the code that uses
the MultiEngineClient only runs in 1 process, usually on the head node.  The
script that uses MultiEngineClient
itself is only run in serial.  But, it connects to the controller and
submits tasks  which are then performed in parallel
by the engines.

I highly recommend first trying to run this in "local" mode on you
workstation:

ipcluster local -n 2

Then fire up IPython, and create a multiengine client and play with it:

http://ipython.scipy.org/doc/nightly/html/parallel/parallel_multiengine.html

Let us know how it goes.

Cheers,

Brian

However, the compute cluster's user instructions say I shouldn't start
> processes in the background, because then they escape the control of the
> job
> scheduler. Is there a way I can make TaskClient() work under this
> restriction?
> Otherwise, I'm just going to manually "killall ipcluster" etc. once my job
> is
> done. (Or maybe that could go as the last lines of my pbs script?)
>
> I'm a complete newbie in this, so any hints are highly appreciated.
>
> Best regards,
> Jon Olav Vik
>
>
> _______________________________________________
> IPython-user mailing list
> IPython-user@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20100209/dc789e67/attachment.html 


More information about the IPython-user mailing list