[IPython-User] FOSS sponsorship and tutorial at PyCon

Olivier Grisel olivier.grisel@ensta....
Fri Jan 13 02:36:40 CST 2012


2012/1/13 Gael Varoquaux <gael.varoquaux@normalesup.org>:
> On Thu, Jan 12, 2012 at 03:46:41PM -0800, Fernando Perez wrote:
>> Yes, it's great!  And btw, the sprinting will be mostly in
>> collaboration with scikits-learn: at least Olivier Grisel and Jacob
>> VanderPlas will be around,
>
> Awesome. I didn't know that Jake would be there. He is a great guy.
>
> I gather that the goal of the sprint would be much more general than the
> scikit-learn. A real map-reduce framework using IPython would greatly
> profit to the scikit-learn, but also any large scale data processing
> application in Python.
>
> By map-reduce, I mean a black box framework that takes mappers, reducers,
> and data, and knows how to spread the tasks optimaly on a network given
> its topology and the data interdependencies between the tasks. Last time
> I discussed with Olivier, I the impression that he had these kinds of
> goals in mind for the sprint. He knows these architectures really well.

Indeed by I would rather say experimenting with building blocks for
distributed data analytics and machine learning on a medium size
cluster:

I am not convinced at all that Hadoop-style MapReduce is the best
setting for machine learning. Having to decompose your algorithm into
mappers and reducers is very hard constraint that makes your code base
very complicated and sometimes completely inefficient, esp. for
iterative machine learning where you would have to combine both your
data and the estimated parameters into a series of key value pairs.

The only justification for adopting the programming constraints
imposed by the MapReduce programming model is that when combined with
a distributed replicated filesystem such as Google Filesystem or
Hadoop DFS you can build a cluster runtime that makes your computation
fault tolerant at large cluster scales: any machine can fail during a
day long computation on 1000 machines with 1000 hard drives. I don't
really want to focus on this use case. There are many interesting
problems that can be solved with 10-100 machines during a couple of
hours where the probability of failure is much less and where fault
tolerance is not the main objective.

On the other hand: data locality, message passing, MPI-style AllReduce
implemented with a spanning tree, efficient broadcasting of data and
parameters, partitioned memory mapped datastructures, in memory
distributed shuffle (to implement joins as done in pandas)... all
sound like very interesting building blocks to experiment with in a
sprint.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel


More information about the IPython-User mailing list