[IPython-User] Delays in getting state, running as a daemon and ipengine reconnecting.

Caius Howcroft caius.howcroft@gmail....
Wed Nov 14 14:03:40 CST 2012


Hi all

I have a couple of questions both I've been messing around with for quite a
while and I thought its about time I asked an expert (or few).

first general set up:
I'm running in EC2.
I'm running 0.12
I start up an ipcontroller then start and stop new instances dependning on
how many machines I need.
Ipengines get their configuration for a nfs mounted home directory.


1) The easy one first. Sometimes I kill the ipcontroller, or have network
problems that cause the ipengines to disconnect. The problem is that if I
have 100's of other instances running it can be a real pain to go around
and restart the ipengines. Can I get the engines to try reconnecting to the
controller automatically after a dropped connection?

2) currently I'm running ipengine in a userdata script which gets run as
root on start up. If I want to run 2 engines as user bob my script looks
like this:

#!/bin/bash

sudo -u bob ipengine  --profile-dir=/users/bob/.ipython/profile_default &
sudo -u bob ipengine  --profile-dir=/users/bob/.ipython/profile_default

This is really error prone. Can I run the ipengine as a daemon as a certain
user? Is there a better way?

3) Third one is the hardest. Much of our jobs are maps. So I do something
like this:

amr = self.lview.map(myjob, cfgs, order=True)

lview is a load balanced view. then I do:

ams = [self.rc.get_result(i, block=False) for i in amr.msg_ids]

I then loop waiting for the jobs to finish, when a contiguous block is done
from zero to "i", I run a function locally.

lastProc = 0
while nworking>0:
   nworking =0
   for i, a in enumerate(ams):
        if a.ready():
             if numpy.all([ b.ready() for b in ams[:i+1]]):
                    #process  up to element i.
                    contiguousProcess(lastProc, i)
                    lastProc=i
        else:
           nworking+=1

This all works great under light testing conditions, however, I have found
that when I really load up the cluster with a lot of jobs then something
gets stuck early on in the list and I basically dont call contiguousProcess
until all the elements are done.  I'm pretty sure that I dont have one job
stuck early on as I see printout coming for ipcontroller staying that jobs
are being submitted to all engines as it rolls forward through all the
maps.

am I doing something wrong?

Cheers
Caius
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20121114/da2fd3bb/attachment.html 


More information about the IPython-User mailing list