[IPython-User] ipcluster: Too many open files (tcp_listener.cpp:213)

Jon Olav Vik jonovik@gmail....
Tue Jun 19 03:06:37 CDT 2012


Fernando Perez <fperez.net <at> gmail.com> writes:

> On Tue, Jun 19, 2012 at 12:39 AM, MinRK <benjaminrk <at> gmail.com> wrote:
> >
> > This happens at the zeromq level - IPython has no way of controlling this.
> 
> Question, what's the number of open fds per engine right now (plus any
> others opened by the hub)? 

My guess is four per engine + some for the hub etc. Repeated tests yesterday 
gave me 240 engines from 10 nodes with 24 processors each, whereas the 11th 
seemed to break the controller's back, consistent with `ulimit -n` / 4 = 256.

> At least if we advertise what the formula
> is, users could tweak their job submission scripts to test their
> environment and only request a total number of engines that's safe
> given the constraints they find...

Googling around, I see that some batch schedulers have constraints on the 
number of concurrent CPUs per users, which would be a perfect fit here.

(My problem is that my scheduled jobs quickly burn out when engines cannot 
connect, and I need to wait until some connections have died, then schedule 
more. I occurs to me now that I could perhaps have my main loop check len
(c.ids) and release batch queue holds once there are vacancies.)


A good workaround might be to have multiple ipclusters running:

jsonfiles = [...]
c = [Client(i) for i in jsonfiles]
lv = [i.load_balanced_view() for i in c]

def do(workpiece):
    pass

pdo = [i.parallel(ordered=False, retries=10)(do) for i in lv]

# Insert clever coordination of multiple clusters here...
# Simple example:
workpieces = ...
workpieces = np.array_split(workpieces, len(c))
async = [i.map(j) for i, j in zip(pdo, workpieces)]

Engine nodes could spread across clusters according to their CPU number or 
something.



More information about the IPython-User mailing list