[IPython-User] Using IPython as a Batch Queue

Wes McKinney wesmckinn@gmail....
Sat Jan 21 16:47:37 CST 2012


On Sat, Jan 21, 2012 at 5:27 PM, MinRK <benjaminrk@gmail.com> wrote:
> On Sat, Jan 21, 2012 at 12:33, Erik Petigura <eptune@gmail.com> wrote:
>> Dear IPython,
>>
>> I want to execute many embarrassingly parallel processes.  The way I am
>> doing it is the following:
>>
>> 1. Generate scripts
>>
>>   $> ls -lth *.py
>>   -rwx------  1 petigura  staff   181B Jan 20 15:08 grid0000.py*
>>
>>                     <snip>
>>
>>   -rwx------  1 petigura  staff   184B Jan 20 15:08 grid2730.py*
>>
>> 2. Run them in a load balanced way in the following manner.
>>
>>   def srun(s):
>>       """
>>       Convert a script to a python call + log
>>       """
>>       log = s.split('.')[0]+'.log'
>>       return subprocess.call( 'python %s > %s' % (s,log) ,shell=True )
>>
>>   view.map(srun,Scripts,block=True)
>>
>> I've run into a couple of problems:
>>
>> Periodically, one of my cores drops out.
>
> Can you explain this one? Is there any indication as to why one of
> your engines fails?  It's possible this an erroneous heart failure,
> which can be alleviated by relaxing the heartbeat period to 5-10
> seconds with:
>
> c.HeartMonitor.period = 10
>
> in your ipcontroller_config.py
>
>
>>  However, when I go back and run it
>> from the shell
>>
>>    $> python script.py
>>
>> it completes.  Is there something that could be hanging the view.map?  One
>> of the reasons why I split my jobs up was is if a script fails, subprocess
>> just passes a 1 and presumably view.map would just go on to the next job.
>
> view.map submits all jobs simultaneously, and an error does not
> prevent later tasks in the map from executing.  The error will be
> raised *locally* in the client, but subsequent tasks continue to run.
> If an engine is going down, then all tasks assigned to that engine
> will fail (1/np tasks during greedy assignment, the default in 0.12
> but no longer in master due to some user confusion).
>
> If you want to protect your tasks from engines shutting down, you can
> add some `retries`, which will resubmit a task a limited number of
> times when it fails before propagating the error up to the client:
>
> view.retries = 2 # retry task after up to two failures
> amr = view.map(srun, scripts)
> # wait for results:
> amr.get()
>
>>
>> Also, I have a hard time stopping the cluster.  Doing
>>
>>    $> ipcluster stop
>>
>> Doesn't work.
>
> Can you clarify?  What doesn't work? Is there a traceback? Is there
> any feedback at all, or does it appear to succeed but leaves processes
> running? How did you start the engines?
>
>>  What I've been doing is listing all the ipengines and stoping
>> them with the kill command.
>
> I've done this many times as well.  In fact, I even have this little
> mess in my environment:
>
> # `ps | grep` utilities
> psgrep(){
>    ps aux | grep -e "$@" | grep -v "grep -e $@"
> }
> psgrepkillall(){
>    echo $(psgrep $@)
>    psgrep $@ | awk '{ print $2 }' | sed "s@^@kill -TERM @" | sh
> }
> alias psg="psgrep"
> alias pskill="psgrepkillall"
>
> so I can do `pskill ipengine` to terminate all engines.
>
> -MinRK
>
>>
>> Thanks in advance for help/advice!
>>
>> Erik
>>
>> _______________________________________________
>> IPython-User mailing list
>> IPython-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-user
>>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user

Aside / question, do you think IPython is a good fit for a batch queue
system? I guess it depends on how the system is being used (e.g.
single vs. multiple users) and what are the robustness requirements. I
myself built something similar Celery (http://celeryproject.org/) a
few years ago (before Celery existed) with the requirement that the
central dispatcher could go down without loss of state
(synchronization of batch status and storage of pickled function
arguments and results in a database like MySQL or MongoDB).

Just a random thought.

- Wes


More information about the IPython-User mailing list