[SciPy-dev] Google Summer of Code and scipy.learn (another trying)
Mon Mar 24 05:09:44 CDT 2008
> The basic idea is to NOT care about format. Just provides basic
> conventions, such as any tool who need dataset can use what they want
> through introspection. In particular, I do not see any problem wrt
> sparse data, but I have not thought a lot about it, so maybe I missed
If you don't care about format users will have to do it. And users don't
like to do it. I don't want parsers for all formats in the world. Tell the
truth, I don't exactly know what I want, but in nearly future I'll try to
learn this question and describe it in derails in proposal.
> > There is no common structure in ML package. It has scattered modules
> > such as svm, em, ann, but no main idea.
> Here is how I see thing: different people have different usages. Some
> people like to try many different learning algorithms (need common
> structure, general tools, etc...), some just want to focus on one
> algorithm. It should be important to keep the different "learning"
> algorithms independant: scikits.machine.em should be usable
> independently, same for other scikits.machine modules; that is, there
> should be one level at which you can just use the algorithms with
> straight numpy arrays (no dataset class or anything). Ideally, I want
> them to have a 100 % python implementation, so that they can be used for
> education purpose. Of course, some C/whatever implementation can be
> possible too, but that should be a complement.
Svm implementation in scikits has a few strange dataset classes(train
dataset, test dataset, etc). My idea is each classifier should use common
dataset classes. It make possible common preprocessing tools.
> So the main idea is to have pre-processing tools, and other things
> common to most algorithms on one side, and the actual learning
> algorithms on another side.
> There is already some code to scale data (handling nan if necessary) in
> scikits/utils with some basic tests. That's really basic, though.>
> If I mistake in understanding current state of affair you can correct me.
> > Well, now about what I want to change.
> > I am going to make learn package appropriate for text classification.
> > Also I want to copy most of PyML (pyml.sourceforge.net/
> > <http://pyml.sourceforge.net/>) functionality.
> > First of all we need sparse data format. I want to write parsers for a
> > number of common data formats.
> What do you mean exactly by sparse data ? At the implementation level,
> ideally, the algorithms should use scipy.sparse, I think. At the "high"
> level, something like spider should be used:
I mean that I have 1 million text pages with 150 thousands different
words(features), but each document has only small part of all 150
thousands world. And if I use simple matrix it will be huge. But if i use
sparse format such as libsvm data format than input file will be much
smaller. I don't know how to do it with scikits now. I know how to do it
with libsvm and many other tools, but I want to make scipy appropriate for
this task. And I want make a tutorial in which one paragraph will be about
> Spider is quite good, and has a good interface, and this with matlab,
> which is quite an achievement.
> Of course, it is your proposal, hence your choice, but I don't think
> focusing on many format to be the right thing at first. One thing I had
> in mind was to implement a "proxy" to communicate between the high level
> representation in scikits.learn and other packages such as weka, orange,
> etc... This would give a practical way use weka with some python tools,
> until we get something on our own for visualization. Spider does have
> something to communicate with weka, for example (it is easier in matlab
> I guess since matlab has a jvm and weka is in java).
I get it, but I don't like java:) And I think a lot of weka functionality is
not very difficult to implement in scipy.
> Scipy-dev mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Scipy-dev