[IPython-User] Parallel question: Sending data directly between engines

Fernando Perez fperez.net@gmail....
Sun Jan 8 13:15:40 CST 2012

On Sun, Jan 8, 2012 at 11:12 AM, Olivier Grisel
<olivier.grisel@ensta.org> wrote:
> I don't know as I am not familiar with the implementations of MPI
> runtimes nor the inner workings of vowpal wabbit and its Hadoop
> AllReduce integration. I would guess there is some kind of external
> monitoring process that can detect those failures and dynamically
> rewire the connected nodes to a new engines or to one another.

OK, thanks.  This is a very useful discussion, thanks for taking the
time to educate me on the matter!

> A similar strategy could be used to detect, remove and reallocate
> "slow nodes" (e.g. failing hard-drives, swapping memory, overloaded
> machine when it is shared with other applications...) so as not to
> slow down the whole computation.

Yup, interesting... Indeed we have all the necessary information.
I'm pretty sure Min added a fair amount of timing data to the internal
metadata tracked by the controller, so something like this is
definitely doable.



More information about the IPython-User mailing list