[IPython-User] ipcluster: Too many open files (tcp_listener.cpp:213)

Jon Olav Vik jonovik@gmail....
Mon Jun 18 08:23:26 CDT 2012

I am using IPython.parallel on a shared Linux cluster, but seem to be running 
into limitations on how many engines I can connect. My workflow is roughly as 

== On the login node (using GNU screen) ==
nice ipcluster start --n=1
for i in {1..100}; do sbatch ~/worker.sh; done
nice python mainscript.py

== worker.sh ==
#SBATCH --time=1:0:0
#SBATCH ...more options...
ipcluster engines

== mainscript.py ==
from IPython.parallel import Client
c = Client()
lv = c.load_balanced_view()

@lv.parallel(ordered=False, retries=10)
def do(workpiece):
    from my_dependencies import do_some_work
    import os
    import socket
    # Feedback on progress
    return workpiece, os.getpid, socket.gethostname()

workpieces = range(1000)
for pid, hostname in do.map(workpieces):
    print pid, hostname

This will flood the job queue with short (hour-long) worker jobs that start 
ipengines. do.map() will add new ipengines to the workforce as they become 
available, and with the retries= option, it will tolerate some tasks being 
interrupted when worker jobs time out in the queue system. The ipcontroller 
needs to stay alive until all tasks are done, so I run it discreetly on the 
login node where there is no timeout. (Only the workers are memory or CPU 
intensive, and I only pass index numbers as workpieces.) In addition, I must 
start one engine so that load_balanced_view() isn't "unable to build targets 
without any engines". I use GNU screen so I can disconnect while leaving my 
jobs running. Once things are done, I cancel the remaining worker jobs.

Everything works reliably on two of the clusters I'm using. On the third, 
however, the controller/hub is unable to handle more than about 250 engines. 
The problem is present even if I run "ipcluster start" on a compute node.

The last lines of "nice ipcluster start --n=1 --debug" are as follows:

2012-06-18 14:03:58.159 [IPClusterStart] 2012-06-18 14:03:58.158 
[IPControllerApp] client::client 'c5dd44d5-b59d-4392-bac0-917e9ef4c9d8' 
requested u'registration_request'
2012-06-18 14:03:58.161 [IPClusterStart] Too many open files 
2012-06-18 14:03:58.274 [IPClusterStart] Process '.../python' stopped: {'pid': 
21820, 'exit_code': -6}
2012-06-18 14:03:58.275 [IPClusterStart] IPython cluster: stopping
2012-06-18 14:03:58.275 [IPClusterStart] Stopping Engines...
2012-06-18 14:04:01.281 [IPClusterStart] Removing pid file: .../.ipython/

The culprit seems to be "Too many open files (tcp_listener.cpp:213)". I would 
like to know where this limit is set, and how to modify it. Also, I wonder if 
it would help to spread connection attempts out in time. That might help if the 
problem is too many simultaneous requests, but not if the limit applies to how 
many engines I can connect simultaneously. Any other advice would be welcome 

(This setup works like a charm when it works. I don't have to guess in advance 
how long the work will take to finish, and the short worker jobs can utilize 
gaps in the queue. I can post more about it later if there is interest.)

Best regards,
Jon Olav

