[IPython-User] Using IPython as a Batch Queue

MinRK benjaminrk@gmail....
Sat Jan 21 16:27:29 CST 2012


On Sat, Jan 21, 2012 at 12:33, Erik Petigura <eptune@gmail.com> wrote:
> Dear IPython,
>
> I want to execute many embarrassingly parallel processes.  The way I am
> doing it is the following:
>
> 1. Generate scripts
>
>   $> ls -lth *.py
>   -rwx------  1 petigura  staff   181B Jan 20 15:08 grid0000.py*
>
>                     <snip>
>
>   -rwx------  1 petigura  staff   184B Jan 20 15:08 grid2730.py*
>
> 2. Run them in a load balanced way in the following manner.
>
>   def srun(s):
>       """
>       Convert a script to a python call + log
>       """
>       log = s.split('.')[0]+'.log'
>       return subprocess.call( 'python %s > %s' % (s,log) ,shell=True )
>
>   view.map(srun,Scripts,block=True)
>
> I've run into a couple of problems:
>
> Periodically, one of my cores drops out.

Can you explain this one? Is there any indication as to why one of
your engines fails?  It's possible this an erroneous heart failure,
which can be alleviated by relaxing the heartbeat period to 5-10
seconds with:

c.HeartMonitor.period = 10

in your ipcontroller_config.py


>  However, when I go back and run it
> from the shell
>
>    $> python script.py
>
> it completes.  Is there something that could be hanging the view.map?  One
> of the reasons why I split my jobs up was is if a script fails, subprocess
> just passes a 1 and presumably view.map would just go on to the next job.

view.map submits all jobs simultaneously, and an error does not
prevent later tasks in the map from executing.  The error will be
raised *locally* in the client, but subsequent tasks continue to run.
If an engine is going down, then all tasks assigned to that engine
will fail (1/np tasks during greedy assignment, the default in 0.12
but no longer in master due to some user confusion).

If you want to protect your tasks from engines shutting down, you can
add some `retries`, which will resubmit a task a limited number of
times when it fails before propagating the error up to the client:

view.retries = 2 # retry task after up to two failures
amr = view.map(srun, scripts)
# wait for results:
amr.get()

>
> Also, I have a hard time stopping the cluster.  Doing
>
>    $> ipcluster stop
>
> Doesn't work.

Can you clarify?  What doesn't work? Is there a traceback? Is there
any feedback at all, or does it appear to succeed but leaves processes
running? How did you start the engines?

>  What I've been doing is listing all the ipengines and stoping
> them with the kill command.

I've done this many times as well.  In fact, I even have this little
mess in my environment:

# `ps | grep` utilities
psgrep(){
    ps aux | grep -e "$@" | grep -v "grep -e $@"
}
psgrepkillall(){
    echo $(psgrep $@)
    psgrep $@ | awk '{ print $2 }' | sed "s@^@kill -TERM @" | sh
}
alias psg="psgrep"
alias pskill="psgrepkillall"

so I can do `pskill ipengine` to terminate all engines.

-MinRK

>
> Thanks in advance for help/advice!
>
> Erik
>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>


More information about the IPython-User mailing list