<br><br><div class="gmail_quote">On Tue, Nov 15, 2011 at 09:57, Ariel Rokem <span dir="ltr"><<a href="mailto:arokem@gmail.com">arokem@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<br><br><div class="gmail_quote"><div><div></div><div class="h5">On Mon, Nov 14, 2011 at 7:45 PM, MinRK <span dir="ltr"><<a href="mailto:benjaminrk@gmail.com" target="_blank">benjaminrk@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">
<br><br><div class="gmail_quote"><div><div></div><div>On Mon, Nov 14, 2011 at 19:10, Ariel Rokem <span dir="ltr"><<a href="mailto:arokem@gmail.com" target="_blank">arokem@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">
Hi everyone, <br><br>Following up on this thread, I am trying to get this working on the SGE on our local cluster (thankfully, everyone is away at a conference, so I have the cluster pretty much to myself. Good week for experimenting...). <br>
<br>I updated my fork from ipython/master this afternoon and followed the instructions below. I am getting the following behavior: <br><br>celadon:~ $ipcluster start --n=10 --profile=sge<br>[IPClusterStart] Using existing profile dir: u'/home/arokem/.config/ipython/profile_sge'<br>
[IPClusterStart] Starting ipcluster with [daemon=False]<br>[IPClusterStart] Creating pid file: /home/arokem/.config/ipython/profile_sge/pid/ipcluster.pid<br>[IPClusterStart] Starting PBSControllerLauncher: ['qsub', u'./sge_controller']<br>
[IPClusterStart] adding job array settings to batch script<br>ERROR:root:Error in periodic callback<br>Traceback (most recent call last):<br> File "/usr/lib64/python2.7/site-packages/zmq/eventloop/ioloop.py", line 423, in _run<br>
self.callback()<br> File "/home/arokem/usr/local/lib/python2.7/site-packages/IPython/parallel/apps/ipclusterapp.py", line 497, in start_controller<br> self.controller_launcher.start()<br> File "/home/arokem/usr/local/lib/python2.7/site-packages/IPython/parallel/apps/launcher.py", line 1022, in start<br>
return super(SGEControllerLauncher, self).start(1)<br> File "/home/arokem/usr/local/lib/python2.7/site-packages/IPython/parallel/apps/launcher.py", line 936, in start<br> self.write_batch_script(n)<br> File "/home/arokem/usr/local/lib/python2.7/site-packages/IPython/parallel/apps/launcher.py", line 925, in write_batch_script<br>
script_as_string = self.formatter.format(self.batch_template, **self.context)<br> File "/usr/lib64/python2.7/string.py", line 545, in format<br> return self.vformat(format_string, args, kwargs)<br> File "/usr/lib64/python2.7/string.py", line 549, in vformat<br>
result = self._vformat(format_string, args, kwargs, used_args, 2)<br> File "/home/arokem/usr/local/lib/python2.7/site-packages/IPython/utils/text.py", line 652, in _vformat<br> obj = eval(field_name, kwargs)<br>
File "<string>", line 1, in <module><br>NameError: name 'n' is not defined<br>[IPClusterStart] Starting 10 engines<br>[IPClusterStart] Starting 10 engines with SGEEngineSetLauncher: ['qsub', u'./sge_engines']<br>
[IPClusterStart] adding job array settings to batch script<br>[IPClusterStart] Writing instantiated batch script: ./sge_engines<br>[IPClusterStart] Job submitted with job id: '430658'<br>[IPClusterStart] Process 'qsub' started: '430658'<br>
[IPClusterStart] Engines appear to have started successfully<br><br>It looks like something goes wrong (the NameError), but then the jobs get submitted and for a brief time, qmon does acknowledge the existence of a list of jobs with that id, but then it disappears (somehow gets deleted?) from qmon almost immediately and when I try to initialize a parallel.Client with the "sge" profile in an ipython session, I get a "TimeoutError: Hub connection request timed out". I also tried initializing ipcluster with the default profile and run some computations and I am getting the approximately 7-fold expected speed-up (on an 8 core machine), so some things do work. Does anyone have any idea what is going wrong with the SGE? <br>
</blockquote><div><br></div></div></div><div>This is a horrible typo that crept in when I did some reorganization in the launchers. Should be fixed in master.</div></div></blockquote></div></div><div><br>Yes - fixed. I don't see that NameError anymore. Thanks! <br>
</div><div class="im"><div><br> </div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex"><div class="gmail_quote"><div></div><div>The TimeoutError in the client generally means that the controller isn't running, or at least isn't where connection files claimed it to be.</div>
<div><div><br></div></div></div></blockquote></div><div><br>OK - I think that the controller really was not there before, but now it is being started, but I am still having trouble getting my engines to persist on the sge. I see them get created through qmon, as well as the ipcontroller, but then the engine jobs are almost immediately deleted from the "running jobs". The controller job persists and when I initialize a client I don't get a TimeoutError, but rather get a client object with an empty ids list. Is that still a problem with the connection files? Are those the ones that are under ~.config/ipython/profile_sge/security? <br>
</div></div></blockquote><div><br></div><div>It could be, or it could be an issue of the engines giving up too soon, if the controller isn't ready for them. What is the output of the engine jobs?</div><div><br></div>
<div>The most likely cases:</div><div><br></div><div>1. the engines start before the controller has written the connection files. They will wait up to `IPEngineApp.wait_for_url_file` for that file to exist (default 5s), then give up.</div>
<div>2. same as 1., but old files exist, so the connection info will be stale. This is normally addressed by setting `IPControllerApp.reuse_files=True`, but I'm not sure that works when the controller is started by SGE, where it won't consistently be on the same host. You may want to manually empty the security dir (IPYTHON_DIR/profile_sge/security) prior to starting the cluster, to prevent this case.</div>
<div>3. connection info is wrong - the Controller is not listening on the right interface, or is listening on localhost only (the default). This is `HubFactory.ip`</div><div>4. regular timeout (controlled by `Engine.timeout`) - the connection info is correct, but the controller does not respond promptly (This value can need to be large in cases where `reuse_files=True`, and the controller/engines start simultaneously or out-of-order).</div>
<div><br></div><div>The easiest solution to all this is usually to increase IPClusterStart.delay, which is a delay (in seconds) between starting the Controller and starting the engines when you do `ipcluster start`. This is less effective in SGE, where the time between calling `ipcluster start` and the jobs actually starting on nodes can be hours - so a few seconds of delay in submitting the batch jobs has no effect. It may be sufficient if your queue is clear, and jobs start right away.</div>
<div><br></div><div>Depending on your sysadmin, it may make sense to *not* start the Controller with SGE, and only entrust SGE with the engines. This gives you more control over the order of events. There is no need *in general* for your Controller and Engines to use the same launchers. There is a reason they are separate config variables.</div>
<div><br></div><div>-MinRK</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="gmail_quote"><div>
<br> </div><div><div></div><div class="h5"><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex"><div class="gmail_quote"><div><div><br><br> </div><div>
</div></div></div></blockquote><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex"><div class="gmail_quote"><div><div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">
<br>Thanks, <br><font color="#888888"><br>Ariel <br></font><div><div></div><div><br><br><br><br><div class="gmail_quote">On Wed, Aug 24, 2011 at 3:07 PM, MinRK <span dir="ltr"><<a href="mailto:benjaminrk@gmail.com" target="_blank">benjaminrk@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">
<div>On Wed, Aug 24, 2011 at 15:05, Dharhas Pothina<br>
<<a href="mailto:Dharhas.Pothina@twdb.state.tx.us" target="_blank">Dharhas.Pothina@twdb.state.tx.us</a>> wrote:<br>
><br>
> I was able to start the engines and they were submitted to the queue<br>
> properly but I do not have a json file in the corresponding security folder.<br>
> Do I need to do something to generate it.<br>
<br>
</div>The JSON file is written by ipcontroller, so it will only show up<br>
after the controller has started.<br>
<div><div></div><div><br>
><br>
> - dharhas<br>
><br>
>>>> MinRK <<a href="mailto:benjaminrk@gmail.com" target="_blank">benjaminrk@gmail.com</a>> 8/24/2011 4:44 PM >>><br>
> On a login node on the cluster:<br>
><br>
> # create profile with default parallel config files, called sge<br>
> [login] $> ipython profile create sge --parallel<br>
><br>
> edit IPYTHON_DIR/profile_sge/ipcontroller_config.py, adding the line:<br>
><br>
> c.HubFactory.ip = '0.0.0.0'<br>
><br>
> to instruct the controller to listen on all interfaces.<br>
><br>
> Edit IPYTHON_DIR/profile_sge/ipcluster_config.py, adding the line:<br>
><br>
> c.IPClusterEngines.engine_launcher_class = 'SGEEngineSetLauncher'<br>
> c.IPClusterStart.controller_launcher_class = 'SGEControllerLauncher'<br>
><br>
> # optional: specify a queue for all:<br>
> c.SGELauncher.queue = 'short'<br>
> To instruct ipcluster to use SGE to launch the engines and the controller<br>
><br>
> At this point, you can start 10 engines and a controller with:<br>
><br>
> [login] $> ipcluster start -n 10 --profile=sge<br>
><br>
> Now the only file you will need to connect to the cluster will be in:<br>
><br>
> IPYTHON_DIR/profile_sge/security/ipcontroller_client.json<br>
><br>
> Just move that file around, and you will be able to connect clients.<br>
> To connect from a laptop, you will probably need to specify a login<br>
> node as the ssh server when you do:<br>
><br>
> from IPython import parallel<br>
><br>
> rc = parallel.Client('/path/to/ipcontroller_client.json',<br>
> sshserver='you@login.mycluster.etc')<br>
><br>
> -MinRK<br>
><br>
><br>
> On Wed, Aug 24, 2011 at 13:18, Dharhas Pothina<br>
> <<a href="mailto:Dharhas.Pothina@twdb.state.tx.us" target="_blank">Dharhas.Pothina@twdb.state.tx.us</a>> wrote:<br>
>> Hi All,<br>
>><br>
>> We have managed to parallelize one of our spatial interpolation scripts<br>
>> very<br>
>> easily with the new ipython parallel. Thanks for developing such a great<br>
>> tool, it was fairly easy to get working. Now we are trying to set things<br>
>> up<br>
>> to run on our internal cluster and I'm having difficulties understanding<br>
>> how<br>
>> to configure things.<br>
>><br>
>> What I would like to do is have ipython running on a local machine<br>
>> (windows<br>
>> & linux) connect to the cluster, request some nodes through SGE and run<br>
>> the<br>
>> computation. I'm not quite getting what goes where from the documentation.<br>
>><br>
>> I think I understood the PBS example but I'm still not understanding where<br>
>> I<br>
>> would put the connection information to log into the cluster. I would<br>
>> really<br>
>> appreciate a step by step of what files need to be where and any example<br>
>> config files for an SGE setup.<br>
>><br>
>> thanks,<br>
>><br>
>> - dharhas<br>
>><br>
>><br>
>><br>
>><br>
>><br>
>> _______________________________________________<br>
>> IPython-User mailing list<br>
>> <a href="mailto:IPython-User@scipy.org" target="_blank">IPython-User@scipy.org</a><br>
>> <a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
>><br>
>><br>
> _______________________________________________<br>
> IPython-User mailing list<br>
> <a href="mailto:IPython-User@scipy.org" target="_blank">IPython-User@scipy.org</a><br>
> <a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
><br>
> _______________________________________________<br>
> IPython-User mailing list<br>
> <a href="mailto:IPython-User@scipy.org" target="_blank">IPython-User@scipy.org</a><br>
> <a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
><br>
><br>
_______________________________________________<br>
IPython-User mailing list<br>
<a href="mailto:IPython-User@scipy.org" target="_blank">IPython-User@scipy.org</a><br>
<a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
</div></div></blockquote></div><br>
</div></div><br>_______________________________________________<br>
IPython-User mailing list<br>
<a href="mailto:IPython-User@scipy.org" target="_blank">IPython-User@scipy.org</a><br>
<a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
<br></blockquote></div></div></div><br>
</blockquote></div></div></div><br>
</blockquote></div><br>