[IPython-User] Parallel question: Sending data directly between engines
Olivier Grisel
olivier.grisel@ensta....
Sun Jan 8 06:26:40 CST 2012
AFAIK the traditional way to implement the AllReduce is the to first a
spanning tree over the nodes / engines. For instance if you have 10
nodes, define a fixed arbitrary binary tree that spans all the nodes
involved in the computation:
0
1
3
7
8
4
9
2
5
6
0 is the root and has 1 and 2 has children. 1 has parent 0 and as 3
and 4 as children and so on.
Each engine is only aware of his parent and 2 direct children. When a
node computation reaches a AllReduce barrier it waits for his to
children to send him the partial results then compute the aggregate
with his own internal state and ship the result to his parent. Leaf
nodes start first without waiting at all (as they know they don't have
any children to wait for). When the root is reached the final result
is recursively broadcasted to all the children.
This spanning tree strategy ensures that a single node node mailbox
will never receive more that 2 messages at once. This is very
important to scale to large clusters (e.g. 1000 nodes) since if you
have many incoming messages of a couple of megabytes you might
saturate the network interface of a single node and potentially it's
memory buffers if the messages are not consumed in a streamed manner.
However as far as I understand IPython.parallel might not be designed
to address 1000+ nodes clusters and the saturation problem probably
does not occur before hundreds of nodes if the messages are not to
big. Still I think it would be good for the IPython.parallel project
to implement such primitives and make some benchmarks to be sure
whether the spanning tree is useful or note. Esp. checking the impact
of the arity of the tree and message size on the overall cluster
performance.
Note that the AllReduce scheme implemented with the spanning tree
strategy impose the aggregation function to be commutative and
distributive. It might not be the case if you implement the naive
gather / reduce / broadcast strategy where you can reorder the partial
data before performing the reduce.
--
Olivier
More information about the IPython-User
mailing list