[IPython-User] Parallel question: Sending data directly between engines

Olivier Grisel olivier.grisel@ensta....
Fri Jan 6 23:58:13 CST 2012


Very interesting thread, thanks for sharing. Let me share some of my
related use cases with to feed the IPython.parallel community with
some additional use cases you might not be aware of.

In machine learning there is an MPI construct that is very useful and
it would be great to be able to implement it efficiently with IPython
/ zmq: namely AllReduce.

Indeed some efficient large scale / "online" machine learning models
scan the training data and incrementally update there parameters every
now and then (e.g. at every row and every mini batch of rows). Often
this update can be expressed as a weighted average of the previous
parameter values and the information (gradient of error) collected on
the current rows. This is the case for Averaged Stochastic Gradient
Descent for classification and online k-means for clustering for
instance.

If we are to adapt such machine learning algorithms to run in a
distributed settings, the natural way would be to scan the
pre-partitioned dataset in parallel engines with each engine having a
local copy of the model parameters that can diverge a bit from one
engine to another. The initial parameters values are broadcasted from
a common init. The model parameters are then synchronized from time to
time by computing a cluster-global weighted average of all the engines
parameters that is then re-broadcasted to all the engines to be
locally updated up until the next synchronization barrier. Depending
on the algorithm, it might also be possible to do the AllReduce
averaging asynchronously without stopping the local stochastic
gradient descent process and remove the global synchronization
barrier.

The parameter vector is expected to fit in memory on each node of the
cluster and should be (at least partially) exchanged many times across
the cluster nodes during the learning process. The typical dimensions
for the model parameters (the payload to synchronize from time to
time) would be for instance 10e5 * 8 = 0.8MB for a single linear
binary classifier of text vectors randomly projected in a 10e5
dimensional space of double values (although single precision would
probably be more than enough here).

The typical dimension for the input data could be arbitrarily big but
it's not that important as it never need to be loaded in memory on any
node and can be partitioned to ship shards to each node of the cluster
before the learning actually start. In the case of Hadoop for instance
the sharded dataset is directly stored on the computational cluster
and actually stays there to leverage data locality.

More info on this blog post: http://hunch.net/?p=2094 Hadoop AllReduce
and Terascale Learning

-- 
Olivier


More information about the IPython-User mailing list