<br><br><div class="gmail_quote">On Mon, Jun 18, 2012 at 6:23 AM, Jon Olav Vik <span dir="ltr"><<a href="mailto:jonovik@gmail.com" target="_blank">jonovik@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I am using IPython.parallel on a shared Linux cluster, but seem to be running<br>
into limitations on how many engines I can connect. My workflow is roughly as<br>
follows:<br>
<br>
== On the login node (using GNU screen) ==<br>
nice ipcluster start --n=1<br>
for i in {1..100}; do sbatch ~/worker.sh; done<br>
nice python mainscript.py<br>
<br>
== worker.sh ==<br>
#!/bin/bash<br>
#SBATCH --time=1:0:0<br>
#SBATCH ...more options...<br>
ipcluster engines<br>
<br>
== mainscript.py ==<br>
from IPython.parallel import Client<br>
c = Client()<br>
lv = c.load_balanced_view()<br>
<br>
@lv.parallel(ordered=False, retries=10)<br>
def do(workpiece):<br>
from my_dependencies import do_some_work<br>
import os<br>
import socket<br>
do_some_work(workpiece)<br>
# Feedback on progress<br>
return workpiece, os.getpid, socket.gethostname()<br>
<br>
workpieces = range(1000)<br>
for pid, hostname in do.map(workpieces):<br>
print pid, hostname<br>
<br>
<br>
This will flood the job queue with short (hour-long) worker jobs that start<br>
ipengines. do.map() will add new ipengines to the workforce as they become<br>
available, and with the retries= option, it will tolerate some tasks being<br>
interrupted when worker jobs time out in the queue system. The ipcontroller<br>
needs to stay alive until all tasks are done, so I run it discreetly on the<br>
login node where there is no timeout. (Only the workers are memory or CPU<br>
intensive, and I only pass index numbers as workpieces.) In addition, I must<br>
start one engine so that load_balanced_view() isn't "unable to build targets<br>
without any engines". I use GNU screen so I can disconnect while leaving my<br>
jobs running. Once things are done, I cancel the remaining worker jobs.<br>
<br>
Everything works reliably on two of the clusters I'm using. On the third,<br>
however, the controller/hub is unable to handle more than about 250 engines.<br>
The problem is present even if I run "ipcluster start" on a compute node.<br>
<br>
The last lines of "nice ipcluster start --n=1 --debug" are as follows:<br>
<br>
2012-06-18 14:03:58.159 [IPClusterStart] 2012-06-18 14:03:58.158<br>
[IPControllerApp] client::client 'c5dd44d5-b59d-4392-bac0-917e9ef4c9d8'<br>
requested u'registration_request'<br>
2012-06-18 14:03:58.161 [IPClusterStart] Too many open files<br>
(tcp_listener.cpp:213)<br>
2012-06-18 14:03:58.274 [IPClusterStart] Process '.../python' stopped: {'pid':<br>
21820, 'exit_code': -6}<br>
2012-06-18 14:03:58.275 [IPClusterStart] IPython cluster: stopping<br>
2012-06-18 14:03:58.275 [IPClusterStart] Stopping Engines...<br>
2012-06-18 14:04:01.281 [IPClusterStart] Removing pid file: .../.ipython/<br>
profile_default/pid/ipcluster.pid<br>
<br>
The culprit seems to be "Too many open files (tcp_listener.cpp:213)". I would<br>
like to know where this limit is set, and how to modify it. Also, I wonder if<br>
it would help to spread connection attempts out in time. That might help if the<br>
problem is too many simultaneous requests, but not if the limit applies to how<br>
many engines I can connect simultaneously. Any other advice would be welcome<br>
too.<br></blockquote><div><br></div><div>This is just the fd limit set by your system. See various docs on changing 'ulimit' for your system.</div><div><br></div><div>You can try to spread out connection attempts, but I don't think it will change anything. </div>
<div>I do not believe there are transient sockets during the connection process.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
<br>
(This setup works like a charm when it works. I don't have to guess in advance<br>
how long the work will take to finish, and the short worker jobs can utilize<br>
gaps in the queue. I can post more about it later if there is interest.)<br>
<br>
Best regards,<br>
Jon Olav<br>
<br>
<br>
_______________________________________________<br>
IPython-User mailing list<br>
<a href="mailto:IPython-User@scipy.org">IPython-User@scipy.org</a><br>
<a href="http://mail.scipy.org/mailman/listinfo/ipython-user" target="_blank">http://mail.scipy.org/mailman/listinfo/ipython-user</a><br>
</blockquote></div><br>