Hi all<br><br>I have a couple of questions both I've been messing around with for quite a while and I thought its about time I asked an expert (or few).<br><br>first general set up: <br>I'm running in EC2. <br>I'm running 0.12 <br>
I start up an ipcontroller then start and stop new instances dependning on how many machines I need. <br>Ipengines get their configuration for a nfs mounted home directory. <br><br><br>1) The easy one first. Sometimes I kill the ipcontroller, or have network problems that cause the ipengines to disconnect. The problem is that if I have 100's of other instances running it can be a real pain to go around and restart the ipengines. Can I get the engines to try reconnecting to the controller automatically after a dropped connection? <br>
<br>2) currently I'm running ipengine in a userdata script which gets run as root on start up. If I want to run 2 engines as user bob my script looks like this:<br><br>#!/bin/bash<br><br>sudo -u bob ipengine --profile-dir=/users/bob/.ipython/profile_default &<br>
sudo -u bob ipengine --profile-dir=/users/bob/.ipython/profile_default <br><br>This is really error prone. Can I run the ipengine as a daemon as a certain user? Is there a better way?<br><br>3) Third one is the hardest. Much of our jobs are maps. So I do something like this:<br>
<br>amr = self.lview.map(myjob, cfgs, order=True)<br><br>lview is a load balanced view. then I do:<br><br>ams = [self.rc.get_result(i, block=False) for i in amr.msg_ids]<br><br>I then loop waiting for the jobs to finish, when a contiguous block is done from zero to "i", I run a function locally. <br>
<br>lastProc = 0<br>while nworking>0:<br> nworking =0 <br> for i, a in enumerate(ams):<br> if a.ready():<br> if numpy.all([ b.ready() for b in ams[:i+1]]):<br> #process up to element i.<br>
contiguousProcess(lastProc, i)<br> lastProc=i<br> else:<br> nworking+=1<br><br>This all works great under light testing conditions, however, I have found that when I really load up the cluster with a lot of jobs then something gets stuck early on in the list and I basically dont call contiguousProcess until all the elements are done. I'm pretty sure that I dont have one job stuck early on as I see printout coming for ipcontroller staying that jobs are being submitted to all engines as it rolls forward through all the maps. <br>
<br>am I doing something wrong?<br><br>Cheers <br>Caius <br><br><br>