[IPython-User] Parallel question: Sending data directly between engines
Sun Jan 8 13:12:51 CST 2012
2012/1/8 Fernando Perez <firstname.lastname@example.org>:
> Hi Olivier,
> On Sun, Jan 8, 2012 at 4:26 AM, Olivier Grisel <email@example.com> wrote:
>> AFAIK the traditional way to implement the AllReduce is the to first a
>> spanning tree over the nodes / engines. For instance if you have 10
>> nodes, define a fixed arbitrary binary tree that spans all the nodes
>> involved in the computation:
> Thanks for the explanation. Indeed, we don't have communication
> patterns with other topologies implemented yet out of the box, so it's
> good to have this use case described well for us. It seems like we
> should be able to use the star-topology for now, which works out of
> the box (but has the limitation you point out), and implementing other
> communication patterns such as a spanning tree one should be easy.
> Do you know how they handle node failure? What happens when a node
> disappears to its children?
I don't know as I am not familiar with the implementations of MPI
runtimes nor the inner workings of vowpal wabbit and its Hadoop
AllReduce integration. I would guess there is some kind of external
monitoring process that can detect those failures and dynamically
rewire the connected nodes to a new engines or to one another.
A similar strategy could be used to detect, remove and reallocate
"slow nodes" (e.g. failing hard-drives, swapping memory, overloaded
machine when it is shared with other applications...) so as not to
slow down the whole computation.
http://twitter.com/ogrisel - http://github.com/ogrisel
More information about the IPython-User