[IPython-User] Delays in getting state, running as a daemon and ipengine reconnecting.
Wed Nov 14 14:03:40 CST 2012
I have a couple of questions both I've been messing around with for quite a
while and I thought its about time I asked an expert (or few).
first general set up:
I'm running in EC2.
I'm running 0.12
I start up an ipcontroller then start and stop new instances dependning on
how many machines I need.
Ipengines get their configuration for a nfs mounted home directory.
1) The easy one first. Sometimes I kill the ipcontroller, or have network
problems that cause the ipengines to disconnect. The problem is that if I
have 100's of other instances running it can be a real pain to go around
and restart the ipengines. Can I get the engines to try reconnecting to the
controller automatically after a dropped connection?
2) currently I'm running ipengine in a userdata script which gets run as
root on start up. If I want to run 2 engines as user bob my script looks
sudo -u bob ipengine --profile-dir=/users/bob/.ipython/profile_default &
sudo -u bob ipengine --profile-dir=/users/bob/.ipython/profile_default
This is really error prone. Can I run the ipengine as a daemon as a certain
user? Is there a better way?
3) Third one is the hardest. Much of our jobs are maps. So I do something
amr = self.lview.map(myjob, cfgs, order=True)
lview is a load balanced view. then I do:
ams = [self.rc.get_result(i, block=False) for i in amr.msg_ids]
I then loop waiting for the jobs to finish, when a contiguous block is done
from zero to "i", I run a function locally.
lastProc = 0
for i, a in enumerate(ams):
if numpy.all([ b.ready() for b in ams[:i+1]]):
#process up to element i.
This all works great under light testing conditions, however, I have found
that when I really load up the cluster with a lot of jobs then something
gets stuck early on in the list and I basically dont call contiguousProcess
until all the elements are done. I'm pretty sure that I dont have one job
stuck early on as I see printout coming for ipcontroller staying that jobs
are being submitted to all engines as it rolls forward through all the
am I doing something wrong?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the IPython-User