[SciPy-dev] Google Summer of Code and scipy.learn (another trying)

David Cournapeau david@ar.media.kyoto-u.ac...
Mon Mar 24 05:41:24 CDT 2008

Anton Slesarev wrote:
> If you don't care about format users will have to do it.

Not really. I think having one format of dataset usable by any learning 
algorithm is impossible. The common format internally is the array given 
by numpy array, sparse, etc... What people do care about is the 
interface to the dataset, not the format of the dataset itself: they 
want to pre process them, select some attributes, etc...

Nothing prevents anyone from implementing some "proxies" to convert one 
common dataset format into a bunch of numpy arrays. Did you look at the 
dataset proposal ? It does provide some examples. Also, there is a proof 
of concept that you can implement data attributes selection, whitening 
and normalization on them.

Again, the current state is quite messy, but I really hope to clean the 
mess in a few weeks when I will go to the US. Right now, I have one 
conference talk to prepare, and two articles to finish for in a few 
days, so I can't do it now, I am sorry for that, I think it would make 
clearer what I had in mind.

> Svm implementation in scikits has a few strange dataset classes(train 
> dataset, test dataset, etc). My idea is each classifier should use 
> common dataset classes. It make possible common preprocessing tools.

svm has the "strange" dataset classes imposed by libsvm. That's 
typically one example why I don't think you will be able to design a 
format usable by everything. Depending on the existing implementation, 
there are already incompatible dataset classes.

I agree that common pre-processing tools should be available. The way I 
saw it was to do all this with pure numpy arrays, and then use some 
proxies to convert back and forth between different packages. Each 
learning packages does it own thing, should have a way to convert 
between its own format and a common interface, and the common tools deal 
with the common interface.

I think you should not focus on format, but on interfaces. Almost any 
"learning" algorithm will need the features, the labels, and maybe a 
representation of the labels. This cover supervised learning, clustering 
and regression. That's the only thing defined by the dataset proposal.

> I mean that I have 1 million text pages with 150 thousands different 
> words(features), but  each  document has only  small part of all 150 
> thousands world. And if  I  use simple matrix it will be huge. But if 
> i use sparse format such as libsvm data format than input file will be 
> much smaller. I don't know how to do it with scikits now. I know how 
> to do it with libsvm and many other tools, but I want to make scipy 
> appropriate for this task. And I want make a tutorial in which one 
> paragraph will be about "Sparse data".

I understand sparse, I don't understand why you cannot use existing 
scipy implementation :)

> I get it, but I don't like java:)

I don't like it either, but it was an example of how the interface of 
spider could be used with foreign packages, weka here.

> And I think a lot of weka functionality is not very difficult to 
> implement in scipy.

Well, nothing is really difficult, in a sense, it just needs time. But 
before we can provide everything weka can provide (and weka here is 
really one example), I think a simple workflow where you can go back and 
forth between between environments is nice. It means more people use 
scikits.learn, which mean potentially more developers, etc...



More information about the Scipy-dev mailing list