[IPython-User] FOSS sponsorship and tutorial at PyCon

Olivier Grisel olivier.grisel@ensta....
Fri Jan 13 02:38:20 CST 2012

2012/1/13 Olivier Grisel <olivier.grisel@ensta.org>:
> Indeed but I would rather say experimenting with building blocks for
> distributed data analytics and machine learning on a medium size
> cluster:
> I am not convinced at all that Hadoop-style MapReduce is the best
> setting for machine learning. Having to decompose your algorithm into
> mappers and reducers is very hard constraint that makes your code base
> very complicated and sometimes completely inefficient, esp. for
> iterative machine learning where you would have to combine both your
> data and the estimated parameters into a series of key value pairs.
> The only justification for adopting the programming constraints
> imposed by the MapReduce programming model is that when combined with
> a distributed replicated filesystem such as Google Filesystem or
> Hadoop DFS you can build a cluster runtime that makes your computation
> fault tolerant at large cluster scales: any machine can fail during a
> day long computation on 1000 machines with 1000 hard drives. I don't
> really want to focus on this use case. There are many interesting
> problems that can be solved with 10-100 machines during a couple of
> hours where the probability of failure is much less and where fault
> tolerance is not the main objective.
> On the other hand: data locality, message passing, MPI-style AllReduce
> implemented with a spanning tree, efficient broadcasting of data and
> parameters, partitioned memory mapped datastructures, in memory
> distributed shuffle (to implement joins as done in pandas)... all
> sound like very interesting building blocks to experiment with in a
> sprint.

By the way here is the sprint wikipage:


http://twitter.com/ogrisel - http://github.com/ogrisel

More information about the IPython-User mailing list