[SciPy-dev] Google Summer of Code and scipy.learn (another trying)
Mon Mar 24 05:41:24 CDT 2008
Anton Slesarev wrote:
> If you don't care about format users will have to do it.
Not really. I think having one format of dataset usable by any learning
algorithm is impossible. The common format internally is the array given
by numpy array, sparse, etc... What people do care about is the
interface to the dataset, not the format of the dataset itself: they
want to pre process them, select some attributes, etc...
Nothing prevents anyone from implementing some "proxies" to convert one
common dataset format into a bunch of numpy arrays. Did you look at the
dataset proposal ? It does provide some examples. Also, there is a proof
of concept that you can implement data attributes selection, whitening
and normalization on them.
Again, the current state is quite messy, but I really hope to clean the
mess in a few weeks when I will go to the US. Right now, I have one
conference talk to prepare, and two articles to finish for in a few
days, so I can't do it now, I am sorry for that, I think it would make
clearer what I had in mind.
> Svm implementation in scikits has a few strange dataset classes(train
> dataset, test dataset, etc). My idea is each classifier should use
> common dataset classes. It make possible common preprocessing tools.
svm has the "strange" dataset classes imposed by libsvm. That's
typically one example why I don't think you will be able to design a
format usable by everything. Depending on the existing implementation,
there are already incompatible dataset classes.
I agree that common pre-processing tools should be available. The way I
saw it was to do all this with pure numpy arrays, and then use some
proxies to convert back and forth between different packages. Each
learning packages does it own thing, should have a way to convert
between its own format and a common interface, and the common tools deal
with the common interface.
I think you should not focus on format, but on interfaces. Almost any
"learning" algorithm will need the features, the labels, and maybe a
representation of the labels. This cover supervised learning, clustering
and regression. That's the only thing defined by the dataset proposal.
> I mean that I have 1 million text pages with 150 thousands different
> words(features), but each document has only small part of all 150
> thousands world. And if I use simple matrix it will be huge. But if
> i use sparse format such as libsvm data format than input file will be
> much smaller. I don't know how to do it with scikits now. I know how
> to do it with libsvm and many other tools, but I want to make scipy
> appropriate for this task. And I want make a tutorial in which one
> paragraph will be about "Sparse data".
I understand sparse, I don't understand why you cannot use existing
scipy implementation :)
> I get it, but I don't like java:)
I don't like it either, but it was an example of how the interface of
spider could be used with foreign packages, weka here.
> And I think a lot of weka functionality is not very difficult to
> implement in scipy.
Well, nothing is really difficult, in a sense, it just needs time. But
before we can provide everything weka can provide (and weka here is
really one example), I think a simple workflow where you can go back and
forth between between environments is nice. It means more people use
scikits.learn, which mean potentially more developers, etc...
More information about the Scipy-dev