[IPython-User] Using IPython as a Batch Queue

Erik Petigura eptune@gmail....
Sat Jan 21 17:39:57 CST 2012


On Jan 21, 2012, at 2:47 PM, Wes McKinney wrote:

> On Sat, Jan 21, 2012 at 5:27 PM, MinRK <benjaminrk@gmail.com> wrote:
>> On Sat, Jan 21, 2012 at 12:33, Erik Petigura <eptune@gmail.com> wrote:
>>> Dear IPython,
>>> 
>>> I want to execute many embarrassingly parallel processes.  The way I am
>>> doing it is the following:
>>> 
>>> 1. Generate scripts
>>> 
>>>   $> ls -lth *.py
>>>   -rwx------  1 petigura  staff   181B Jan 20 15:08 grid0000.py*
>>> 
>>>                     <snip>
>>> 
>>>   -rwx------  1 petigura  staff   184B Jan 20 15:08 grid2730.py*
>>> 
>>> 2. Run them in a load balanced way in the following manner.
>>> 
>>>   def srun(s):
>>>       """
>>>       Convert a script to a python call + log
>>>       """
>>>       log = s.split('.')[0]+'.log'
>>>       return subprocess.call( 'python %s > %s' % (s,log) ,shell=True )
>>> 
>>>   view.map(srun,Scripts,block=True)
>>> 
>>> I've run into a couple of problems:
>>> 
>>> Periodically, one of my cores drops out.
>> 
>> Can you explain this one? Is there any indication as to why one of
>> your engines fails?  It's possible this an erroneous heart failure,
>> which can be alleviated by relaxing the heartbeat period to 5-10
>> seconds with:
>> 
>> c.HeartMonitor.period = 10
>> 
>> in your ipcontroller_config.py
>> 
>> 
>>>  However, when I go back and run it
>>> from the shell
>>> 
>>>    $> python script.py
>>> 
>>> it completes.  Is there something that could be hanging the view.map?  One
>>> of the reasons why I split my jobs up was is if a script fails, subprocess
>>> just passes a 1 and presumably view.map would just go on to the next job.
>> 
>> view.map submits all jobs simultaneously, and an error does not
>> prevent later tasks in the map from executing.  The error will be
>> raised *locally* in the client, but subsequent tasks continue to run.
>> If an engine is going down, then all tasks assigned to that engine
>> will fail (1/np tasks during greedy assignment, the default in 0.12
>> but no longer in master due to some user confusion).
>> 
>> If you want to protect your tasks from engines shutting down, you can
>> add some `retries`, which will resubmit a task a limited number of
>> times when it fails before propagating the error up to the client:
>> 
>> view.retries = 2 # retry task after up to two failures
>> amr = view.map(srun, scripts)
>> # wait for results:
>> amr.get()
>> 
>>> 
>>> Also, I have a hard time stopping the cluster.  Doing
>>> 
>>>    $> ipcluster stop
>>> 
>>> Doesn't work.
>> 
>> Can you clarify?  What doesn't work? Is there a traceback? Is there
>> any feedback at all, or does it appear to succeed but leaves processes
>> running? How did you start the engines?
>> 
>>>  What I've been doing is listing all the ipengines and stoping
>>> them with the kill command.
>> 
>> I've done this many times as well.  In fact, I even have this little
>> mess in my environment:
>> 
>> # `ps | grep` utilities
>> psgrep(){
>>    ps aux | grep -e "$@" | grep -v "grep -e $@"
>> }
>> psgrepkillall(){
>>    echo $(psgrep $@)
>>    psgrep $@ | awk '{ print $2 }' | sed "s@^@kill -TERM @" | sh
>> }
>> alias psg="psgrep"
>> alias pskill="psgrepkillall"
>> 
>> so I can do `pskill ipengine` to terminate all engines.
>> 
>> -MinRK
>> 
>>> 
>>> Thanks in advance for help/advice!
>>> 
>>> Erik
>>> 
>>> _______________________________________________
>>> IPython-User mailing list
>>> IPython-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/ipython-user
>>> 
>> _______________________________________________
>> IPython-User mailing list
>> IPython-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-user
> 
> Aside / question, do you think IPython is a good fit for a batch queue
> system? I guess it depends on how the system is being used (e.g.
> single vs. multiple users) and what are the robustness requirements. I
> myself built something similar Celery (http://celeryproject.org/) a
> few years ago (before Celery existed) with the requirement that the
> central dispatcher could go down without loss of state
> (synchronization of batch status and storage of pickled function
> arguments and results in a database like MySQL or MongoDB).
> 
> Just a random thought.
> 
> - Wes


Thanks for your suggestion.  I'm using IPython because that's the only parallel tool I know.  The IPython team has developed a great tool for *interactive* parallelism, but it might be overkill for what I need.

My group has a small cluster (4, 8-core MacPros) that have shared file system.  Their duty cycle is pretty low.  For a previous project involving data parallelism, I simply split my computation into 32 jobs and executed them by logging into each machine and spawning 8 jobs.  There's got to be a better way than this!  I think IPython might be a good solution for me because a single controller can be aware of other machines on a network and take care of load balancing.

Back when I was at SLAC we used a program called bsub to do just this.

Erik























More information about the IPython-User mailing list