[IPython-User] Using IPython as a Batch Queue

Erik Petigura eptune@gmail....
Sat Jan 21 17:29:19 CST 2012


Hi, Min.

Thanks for your suggestion regarding the heartbeat and the grep + kill script.  I'll see if I can get some clues as to why the engines are dropping out.  The problem is that when I try to reprocess a script that fails, it runs without a problem.

I'll keep you posted!

Erik


On Jan 21, 2012, at 2:27 PM, MinRK wrote:

> On Sat, Jan 21, 2012 at 12:33, Erik Petigura <eptune@gmail.com> wrote:
>> Dear IPython,
>> 
>> I want to execute many embarrassingly parallel processes.  The way I am
>> doing it is the following:
>> 
>> 1. Generate scripts
>> 
>>   $> ls -lth *.py
>>   -rwx------  1 petigura  staff   181B Jan 20 15:08 grid0000.py*
>> 
>>                     <snip>
>> 
>>   -rwx------  1 petigura  staff   184B Jan 20 15:08 grid2730.py*
>> 
>> 2. Run them in a load balanced way in the following manner.
>> 
>>   def srun(s):
>>       """
>>       Convert a script to a python call + log
>>       """
>>       log = s.split('.')[0]+'.log'
>>       return subprocess.call( 'python %s > %s' % (s,log) ,shell=True )
>> 
>>   view.map(srun,Scripts,block=True)
>> 
>> I've run into a couple of problems:
>> 
>> Periodically, one of my cores drops out.
> 
> Can you explain this one? Is there any indication as to why one of
> your engines fails?  It's possible this an erroneous heart failure,
> which can be alleviated by relaxing the heartbeat period to 5-10
> seconds with:
> 
> c.HeartMonitor.period = 10
> 
> in your ipcontroller_config.py
> 
> 
>>  However, when I go back and run it
>> from the shell
>> 
>>    $> python script.py
>> 
>> it completes.  Is there something that could be hanging the view.map?  One
>> of the reasons why I split my jobs up was is if a script fails, subprocess
>> just passes a 1 and presumably view.map would just go on to the next job.
> 
> view.map submits all jobs simultaneously, and an error does not
> prevent later tasks in the map from executing.  The error will be
> raised *locally* in the client, but subsequent tasks continue to run.
> If an engine is going down, then all tasks assigned to that engine
> will fail (1/np tasks during greedy assignment, the default in 0.12
> but no longer in master due to some user confusion).
> 
> If you want to protect your tasks from engines shutting down, you can
> add some `retries`, which will resubmit a task a limited number of
> times when it fails before propagating the error up to the client:
> 
> view.retries = 2 # retry task after up to two failures
> amr = view.map(srun, scripts)
> # wait for results:
> amr.get()
> 
>> 
>> Also, I have a hard time stopping the cluster.  Doing
>> 
>>    $> ipcluster stop
>> 
>> Doesn't work.
> 
> Can you clarify?  What doesn't work? Is there a traceback? Is there
> any feedback at all, or does it appear to succeed but leaves processes
> running? How did you start the engines?
> 
>>  What I've been doing is listing all the ipengines and stoping
>> them with the kill command.
> 
> I've done this many times as well.  In fact, I even have this little
> mess in my environment:
> 
> # `ps | grep` utilities
> psgrep(){
>    ps aux | grep -e "$@" | grep -v "grep -e $@"
> }
> psgrepkillall(){
>    echo $(psgrep $@)
>    psgrep $@ | awk '{ print $2 }' | sed "s@^@kill -TERM @" | sh
> }
> alias psg="psgrep"
> alias pskill="psgrepkillall"
> 
> so I can do `pskill ipengine` to terminate all engines.
> 
> -MinRK
> 
>> 
>> Thanks in advance for help/advice!
>> 
>> Erik
>> 
>> _______________________________________________
>> IPython-User mailing list
>> IPython-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-user
>> 



More information about the IPython-User mailing list