[IPython-User] IPython.parallel over Infiniband?
Jon Olav Vik
jonovik@gmail....
Fri Sep 7 06:21:54 CDT 2012
Short version: Can the IPython.parallel ipcontroller and ipengines use
Infiniband for communication?
Background:
As mentioned in previous posts, I use IPython.parallel on a shared batch
cluster, where I submit ipengines as relatively short "batch jobs" for use by a
Client.load_balanced_view(retries=..., chunksize=..., ordered=False). This
gives me load-balanced, fault-tolerant (in particular if an engine job times
out) computing of otherwise trivially parallel tasks. This is by far the most
maintainable framework I've found, and it scales well to at least 100
processors, or > 600 if I use several clusters. The limiting factor seems to be
the number and latency of TCP connections.
I recently got kicked out from a batch cluster for failing to utilize their
precious Infiniband, and for *possibly* competing with the batch system's use
of TCP. (This was not further investigated, as that cluster was intended to
fill other needs than my rather-trivially-parallel computing, and so they
didn't really want me around anyway.)
Now, I know next to nothing about what Infiniband is, but googling suggested
that TCP can be run over Infiniband.
http://pkg-ofed.alioth.debian.org/howto/infiniband-howto-5.html
I wonder if that could improve the latency of IPython.parallel tasks, while
letting me be less of a nuisance to the batch cluster admins. Any hints on
whether and how this can be achieved would be most appreciated.
(I have mostly heard about Infiniband in connection with MPI. However, MPI
doesn't seem to fit my needs because 1) all MPI processes need to start and
stop at the same time, whereas I wish to use as many processors as happen to be
available, without specifying the number in advance, 2) the ipcluster cannot
use MPI for coordination, and 3) I wish to distribute tasks and results using a
load_balanced_view() and not explicitly over MPI.)
Best regards,
Jon Olav
More information about the IPython-User
mailing list