[IPython-User] Parallel question: Sending data directly between engines

Olivier Grisel olivier.grisel@ensta....
Sun Jan 8 13:12:51 CST 2012


2012/1/8 Fernando Perez <fperez.net@gmail.com>:
> Hi Olivier,
>
> On Sun, Jan 8, 2012 at 4:26 AM, Olivier Grisel <olivier.grisel@ensta.org> wrote:
>> AFAIK the traditional way to implement the AllReduce is the to first a
>> spanning tree over the nodes / engines. For instance if you have 10
>> nodes, define a fixed arbitrary binary tree that spans all the nodes
>> involved in the computation:
>>
>> 0
>
> Thanks for the explanation.  Indeed, we don't have communication
> patterns with other topologies implemented yet out of the box, so it's
> good to have this use case described well for us.  It seems like we
> should be able to use the star-topology for now, which works out of
> the box (but has the limitation you point out), and implementing other
> communication patterns such as a spanning tree one should be easy.
>
> Do you know how they handle node failure?  What happens when a node
> disappears to its children?

I don't know as I am not familiar with the implementations of MPI
runtimes nor the inner workings of vowpal wabbit and its Hadoop
AllReduce integration. I would guess there is some kind of external
monitoring process that can detect those failures and dynamically
rewire the connected nodes to a new engines or to one another.

A similar strategy could be used to detect, remove and reallocate
"slow nodes" (e.g. failing hard-drives, swapping memory, overloaded
machine when it is shared with other applications...) so as not to
slow down the whole computation.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel


More information about the IPython-User mailing list