[IPython-User] Parallel question: Sending data directly between engines
Sun Jan 8 13:15:40 CST 2012
On Sun, Jan 8, 2012 at 11:12 AM, Olivier Grisel
> I don't know as I am not familiar with the implementations of MPI
> runtimes nor the inner workings of vowpal wabbit and its Hadoop
> AllReduce integration. I would guess there is some kind of external
> monitoring process that can detect those failures and dynamically
> rewire the connected nodes to a new engines or to one another.
OK, thanks. This is a very useful discussion, thanks for taking the
time to educate me on the matter!
> A similar strategy could be used to detect, remove and reallocate
> "slow nodes" (e.g. failing hard-drives, swapping memory, overloaded
> machine when it is shared with other applications...) so as not to
> slow down the whole computation.
Yup, interesting... Indeed we have all the necessary information.
I'm pretty sure Min added a fair amount of timing data to the internal
metadata tracked by the controller, so something like this is
More information about the IPython-User