[IPython-User] ipcluster: Too many open files (tcp_listener.cpp:213)

MinRK benjaminrk@gmail....
Mon Jun 18 15:09:39 CDT 2012


On Mon, Jun 18, 2012 at 6:23 AM, Jon Olav Vik <jonovik@gmail.com> wrote:

> I am using IPython.parallel on a shared Linux cluster, but seem to be
> running
> into limitations on how many engines I can connect. My workflow is roughly
> as
> follows:
>
> == On the login node (using GNU screen) ==
> nice ipcluster start --n=1
> for i in {1..100}; do sbatch ~/worker.sh; done
> nice python mainscript.py
>
> == worker.sh ==
> #!/bin/bash
> #SBATCH --time=1:0:0
> #SBATCH ...more options...
> ipcluster engines
>
> == mainscript.py ==
> from IPython.parallel import Client
> c = Client()
> lv = c.load_balanced_view()
>
> @lv.parallel(ordered=False, retries=10)
> def do(workpiece):
>    from my_dependencies import do_some_work
>    import os
>    import socket
>    do_some_work(workpiece)
>    # Feedback on progress
>    return workpiece, os.getpid, socket.gethostname()
>
> workpieces = range(1000)
> for pid, hostname in do.map(workpieces):
>    print pid, hostname
>
>
> This will flood the job queue with short (hour-long) worker jobs that start
> ipengines. do.map() will add new ipengines to the workforce as they become
> available, and with the retries= option, it will tolerate some tasks being
> interrupted when worker jobs time out in the queue system. The ipcontroller
> needs to stay alive until all tasks are done, so I run it discreetly on the
> login node where there is no timeout. (Only the workers are memory or CPU
> intensive, and I only pass index numbers as workpieces.) In addition, I
> must
> start one engine so that load_balanced_view() isn't "unable to build
> targets
> without any engines". I use GNU screen so I can disconnect while leaving my
> jobs running. Once things are done, I cancel the remaining worker jobs.
>
> Everything works reliably on two of the clusters I'm using. On the third,
> however, the controller/hub is unable to handle more than about 250
> engines.
> The problem is present even if I run "ipcluster start" on a compute node.
>
> The last lines of "nice ipcluster start --n=1 --debug" are as follows:
>
> 2012-06-18 14:03:58.159 [IPClusterStart] 2012-06-18 14:03:58.158
> [IPControllerApp] client::client 'c5dd44d5-b59d-4392-bac0-917e9ef4c9d8'
> requested u'registration_request'
> 2012-06-18 14:03:58.161 [IPClusterStart] Too many open files
> (tcp_listener.cpp:213)
> 2012-06-18 14:03:58.274 [IPClusterStart] Process '.../python' stopped:
> {'pid':
> 21820, 'exit_code': -6}
> 2012-06-18 14:03:58.275 [IPClusterStart] IPython cluster: stopping
> 2012-06-18 14:03:58.275 [IPClusterStart] Stopping Engines...
> 2012-06-18 14:04:01.281 [IPClusterStart] Removing pid file: .../.ipython/
> profile_default/pid/ipcluster.pid
>
> The culprit seems to be "Too many open files (tcp_listener.cpp:213)". I
> would
> like to know where this limit is set, and how to modify it. Also, I wonder
> if
> it would help to spread connection attempts out in time. That might help
> if the
> problem is too many simultaneous requests, but not if the limit applies to
> how
> many engines I can connect simultaneously. Any other advice would be
> welcome
> too.
>

This is just the fd limit set by your system.  See various docs on changing
'ulimit' for your system.

You can try to spread out connection attempts, but I don't think it will
change anything.
I do not believe there are transient sockets during the connection process.



>
>
> (This setup works like a charm when it works. I don't have to guess in
> advance
> how long the work will take to finish, and the short worker jobs can
> utilize
> gaps in the queue. I can post more about it later if there is interest.)
>
> Best regards,
> Jon Olav
>
>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20120618/389aa195/attachment-0001.html 


More information about the IPython-User mailing list