[IPython-User] Using IPython as a Batch Queue
Erik Petigura
eptune@gmail....
Sat Jan 21 17:29:19 CST 2012
Hi, Min.
Thanks for your suggestion regarding the heartbeat and the grep + kill script. I'll see if I can get some clues as to why the engines are dropping out. The problem is that when I try to reprocess a script that fails, it runs without a problem.
I'll keep you posted!
Erik
On Jan 21, 2012, at 2:27 PM, MinRK wrote:
> On Sat, Jan 21, 2012 at 12:33, Erik Petigura <eptune@gmail.com> wrote:
>> Dear IPython,
>>
>> I want to execute many embarrassingly parallel processes. The way I am
>> doing it is the following:
>>
>> 1. Generate scripts
>>
>> $> ls -lth *.py
>> -rwx------ 1 petigura staff 181B Jan 20 15:08 grid0000.py*
>>
>> <snip>
>>
>> -rwx------ 1 petigura staff 184B Jan 20 15:08 grid2730.py*
>>
>> 2. Run them in a load balanced way in the following manner.
>>
>> def srun(s):
>> """
>> Convert a script to a python call + log
>> """
>> log = s.split('.')[0]+'.log'
>> return subprocess.call( 'python %s > %s' % (s,log) ,shell=True )
>>
>> view.map(srun,Scripts,block=True)
>>
>> I've run into a couple of problems:
>>
>> Periodically, one of my cores drops out.
>
> Can you explain this one? Is there any indication as to why one of
> your engines fails? It's possible this an erroneous heart failure,
> which can be alleviated by relaxing the heartbeat period to 5-10
> seconds with:
>
> c.HeartMonitor.period = 10
>
> in your ipcontroller_config.py
>
>
>> However, when I go back and run it
>> from the shell
>>
>> $> python script.py
>>
>> it completes. Is there something that could be hanging the view.map? One
>> of the reasons why I split my jobs up was is if a script fails, subprocess
>> just passes a 1 and presumably view.map would just go on to the next job.
>
> view.map submits all jobs simultaneously, and an error does not
> prevent later tasks in the map from executing. The error will be
> raised *locally* in the client, but subsequent tasks continue to run.
> If an engine is going down, then all tasks assigned to that engine
> will fail (1/np tasks during greedy assignment, the default in 0.12
> but no longer in master due to some user confusion).
>
> If you want to protect your tasks from engines shutting down, you can
> add some `retries`, which will resubmit a task a limited number of
> times when it fails before propagating the error up to the client:
>
> view.retries = 2 # retry task after up to two failures
> amr = view.map(srun, scripts)
> # wait for results:
> amr.get()
>
>>
>> Also, I have a hard time stopping the cluster. Doing
>>
>> $> ipcluster stop
>>
>> Doesn't work.
>
> Can you clarify? What doesn't work? Is there a traceback? Is there
> any feedback at all, or does it appear to succeed but leaves processes
> running? How did you start the engines?
>
>> What I've been doing is listing all the ipengines and stoping
>> them with the kill command.
>
> I've done this many times as well. In fact, I even have this little
> mess in my environment:
>
> # `ps | grep` utilities
> psgrep(){
> ps aux | grep -e "$@" | grep -v "grep -e $@"
> }
> psgrepkillall(){
> echo $(psgrep $@)
> psgrep $@ | awk '{ print $2 }' | sed "s@^@kill -TERM @" | sh
> }
> alias psg="psgrep"
> alias pskill="psgrepkillall"
>
> so I can do `pskill ipengine` to terminate all engines.
>
> -MinRK
>
>>
>> Thanks in advance for help/advice!
>>
>> Erik
>>
>> _______________________________________________
>> IPython-User mailing list
>> IPython-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-user
>>
More information about the IPython-User
mailing list