[IPython-User] ipcluster: Too many open files (tcp_listener.cpp:213)

MinRK benjaminrk@gmail....
Tue Jun 19 03:15:30 CDT 2012


On Tue, Jun 19, 2012 at 1:06 AM, Jon Olav Vik <jonovik@gmail.com> wrote:

> Fernando Perez <fperez.net <at> gmail.com> writes:
>
> > On Tue, Jun 19, 2012 at 12:39 AM, MinRK <benjaminrk <at> gmail.com>
> wrote:
> > >
> > > This happens at the zeromq level - IPython has no way of controlling
> this.
> >
> > Question, what's the number of open fds per engine right now (plus any
> > others opened by the hub)?
>
> My guess is four per engine + some for the hub etc. Repeated tests
> yesterday
> gave me 240 engines from 10 nodes with 24 processors each, whereas the 11th
> seemed to break the controller's back, consistent with `ulimit -n` / 4 =
> 256.
>

It is 3, but there are already quite a few already before any engines
connect (~40).


>
> > At least if we advertise what the formula
> > is, users could tweak their job submission scripts to test their
> > environment and only request a total number of engines that's safe
> > given the constraints they find...
>
> Googling around, I see that some batch schedulers have constraints on the
> number of concurrent CPUs per users, which would be a perfect fit here.
>

That helps as long as you restrict yourself to one engine per CPU, which is
generally totally up to the user.


>
> (My problem is that my scheduled jobs quickly burn out when engines cannot
> connect, and I need to wait until some connections have died, then schedule
> more. I occurs to me now that I could perhaps have my main loop check len
> (c.ids) and release batch queue holds once there are vacancies.)
>
>
> A good workaround might be to have multiple ipclusters running:
>
> jsonfiles = [...]
> c = [Client(i) for i in jsonfiles]
> lv = [i.load_balanced_view() for i in c]
>
> def do(workpiece):
>    pass
>
> pdo = [i.parallel(ordered=False, retries=10)(do) for i in lv]
>
> # Insert clever coordination of multiple clusters here...
> # Simple example:
> workpieces = ...
> workpieces = np.array_split(workpieces, len(c))
> async = [i.map(j) for i, j in zip(pdo, workpieces)]
>
> Engine nodes could spread across clusters according to their CPU number or
> something.
>

Yes, this is the sensible approach.  I have in my mind a scaling model for
the cluster where schedulers are sharded, and engines/clients are split
across different collections of identical schedulers.  With the way zeromq
works, this is actually super easy with one exception: The load-balanced
scheduler which has state, and would presumably need to share that state
with its clones somehow.  All of the other schedulers are totally
stateless, and can be replicated without any issue.


>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20120619/e8bad32c/attachment-0001.html 


More information about the IPython-User mailing list