[SciPy-dev] Google Summer of Code and scipy.learn (another trying)

Anton Slesarev slesarev.anton@gmail....
Mon Mar 24 08:30:25 CDT 2008


On Mon, Mar 24, 2008 at 1:41 PM, David Cournapeau <
david@ar.media.kyoto-u.ac.jp> wrote:

> Anton Slesarev wrote:
> > If you don't care about format users will have to do it.
>
> Not really. I think having one format of dataset usable by any learning
> algorithm is impossible. The common format internally is the array given
> by numpy array, sparse, etc... What people do care about is the
> interface to the dataset, not the format of the dataset itself: they
> want to pre process them, select some attributes, etc...


Yes! We need common (for all scikit.learn) dataset class, which can load
different formats of data. I mean approximately this.

>
>
> Nothing prevents anyone from implementing some "proxies" to convert one
> common dataset format into a bunch of numpy arrays. Did you look at the
> dataset proposal ? It does provide some examples. Also, there is a proof
> of concept that you can implement data attributes selection, whitening
> and normalization on them.


I've read it. And it is not support sparse format.

>From proposal:
        - selecting only a subset of all samples.
        - selecting only a subset of the attributes (only sepal length and
          width, for example).
        - selecting only the samples of a given class.
        - small summary of the dataset.
I need different attributes for different samples.



>
> >
> > Svm implementation in scikits has a few strange dataset classes(train
> > dataset, test dataset, etc). My idea is each classifier should use
> > common dataset classes. It make possible common preprocessing tools.
>
> svm has the "strange" dataset classes imposed by libsvm. That's
> typically one example why I don't think you will be able to design a
> format usable by everything. Depending on the existing implementation,
> there are already incompatible dataset classes.
>
> I agree that common pre-processing tools should be available. The way I
> saw it was to do all this with pure numpy arrays, and then use some
> proxies to convert back and forth between different packages. Each
> learning packages does it own thing, should have a way to convert
> between its own format and a common interface, and the common tools deal
> with the common interface.
>
> I think you should not focus on format, but on interfaces. Almost any
> "learning" algorithm will need the features, the labels, and maybe a
> representation of the labels. This cover supervised learning, clustering
> and regression. That's the only thing defined by the dataset proposal.
>

Yes

>
> >
> > I mean that I have 1 million text pages with 150 thousands different
> > words(features), but  each  document has only  small part of all 150
> > thousands world. And if  I  use simple matrix it will be huge. But if
> > i use sparse format such as libsvm data format than input file will be
> > much smaller. I don't know how to do it with scikits now. I know how
> > to do it with libsvm and many other tools, but I want to make scipy
> > appropriate for this task. And I want make a tutorial in which one
> > paragraph will be about "Sparse data".
>
> I understand sparse, I don't understand why you cannot use existing
> scipy implementation :)
>

Becouse you write about learn.datasets, but you want I use scipy
implementation without feature selection, common normalization and other
features you just describe. I say that it should be in wrapper which is call
dataset(maybe sparsedataset) and user don't want what in it.

Yes it can be implemented by .sparse. But it should be implemented, thats
all I want to say.

>
> >
> > I get it, but I don't like java:)
>
> I don't like it either, but it was an example of how the interface of
> spider could be used with foreign packages, weka here.
>
> > And I think a lot of weka functionality is not very difficult to
> > implement in scipy.
>
> Well, nothing is really difficult, in a sense, it just needs time. But
> before we can provide everything weka can provide (and weka here is
> really one example), I think a simple workflow where you can go back and
> forth between between environments is nice. It means more people use
> scikits.learn, which mean potentially more developers, etc...
>

May be.

>
> cheers,
>
> David
> _______________________________________________
> Scipy-dev mailing list
> Scipy-dev@scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-dev
>


We speak about the same thing. I just want to improve current dataset, one
of the improving to make it possible to work with sparse data. A format of
sparse text files is really no matter.

I understand your position about integrating weka or something else in
scipy. May be it is a good idea. But tell the truth it is hard to imagine
for me in details. I want to integrate some separate libraries such as
libsvm, bbr and something else or rewrite it in python if it is possible.

-- 
Anton Slesarev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://projects.scipy.org/pipermail/scipy-dev/attachments/20080324/5b78a2b8/attachment.html 


More information about the Scipy-dev mailing list