[SciPy-dev] Google Summer of Code and scipy.learn (another trying)
Wed Mar 19 22:57:39 CDT 2008
Anton Slesarev wrote:
> I'm going to describe what problems I see in current version of
> scikits.learn. After that I'll write what I want to improve during
> Google Summer of Code. In my last letter I tried to numerate some
> limitations in other open-source frameworks such as PyML and Orange.
Sorry for the late answer, but I have been busy the last few days at
a conference. Most of your points are valid, that is there are some
useful bits in there now, but disparate. The main reason is that I
started working on a new build system for numpy/scipy, which took all my
free time. I intend to work again on scikits.learn soon, though.
Here are some comments:
> Let's start about Scikits.learn.
> First of all is a lack of documentation. I can find nothing beside
> David Cournapeau proposal on google Summer of Code. Nothing in wiki
> and nothing in maillist. There are few examples in svm, of course. But
> it is very hard use only examples.
scikits.machine.em and scikits.machine.svm are usable, but they do not
have common usage. They do have docs, though (particularly
scikits.machine.em: there is 15 pages pdf tutorial). Most examples do
not work, but that's only because of some changes in the dataset format
(see below), and that should be easy to change.
I think examples are really important. The problem with docs is the
format and tools to generate them; other people will know better on this
particular point. I know I don't like the current tools myself (epydoc,
etc...), but that's just my opinion.
> I can't find parser of different data formats. Only for datasets. As I
> understand datasets don't support sparse data format.
On dataset, there is something:
The basic idea is to NOT care about format. Just provides basic
conventions, such as any tool who need dataset can use what they want
through introspection. In particular, I do not see any problem wrt
sparse data, but I have not thought a lot about it, so maybe I missed
So there is no format, and no parsing needs :) I think it is impossible
to agree on something which is useful and usable by everybody (the
proposal does not focus exclusively on machine learning), hence the lack
of format specification.
I worked on a parser for arff (weka format), which could read most arff
files I could throw at it. But recently, someone wrote a grammar for the
Unfortunately, I do not know anything about parsing and co, but I think
anybody who studied computer science in some way would know how to get a
python module to parse files following this grammar in no time, I guess.
> There is no common structure in ML package. It has scattered modules
> such as svm, em, ann, but no main idea.
Here is how I see thing: different people have different usages. Some
people like to try many different learning algorithms (need common
structure, general tools, etc...), some just want to focus on one
algorithm. It should be important to keep the different "learning"
algorithms independant: scikits.machine.em should be usable
independently, same for other scikits.machine modules; that is, there
should be one level at which you can just use the algorithms with
straight numpy arrays (no dataset class or anything). Ideally, I want
them to have a 100 % python implementation, so that they can be used for
education purpose. Of course, some C/whatever implementation can be
possible too, but that should be a complement.
So the main idea is to have pre-processing tools, and other things
common to most algorithms on one side, and the actual learning
algorithms on another side.
There is already some code to scale data (handling nan if necessary) in
scikits/utils with some basic tests. That's really basic, though.
> If I mistake in understanding current state of affair you can correct me.
> Well, now about what I want to change.
> I am going to make learn package appropriate for text classification.
> Also I want to copy most of PyML (pyml.sourceforge.net/
> <http://pyml.sourceforge.net/>) functionality.
> First of all we need sparse data format. I want to write parsers for a
> number of common data formats.
What do you mean exactly by sparse data ? At the implementation level,
ideally, the algorithms should use scipy.sparse, I think. At the "high"
level, something like spider should be used:
Spider is quite good, and has a good interface, and this with matlab,
which is quite an achievement.
Of course, it is your proposal, hence your choice, but I don't think
focusing on many format to be the right thing at first. One thing I had
in mind was to implement a "proxy" to communicate between the high level
representation in scikits.learn and other packages such as weka, orange,
etc... This would give a practical way use weka with some python tools,
until we get something on our own for visualization. Spider does have
something to communicate with weka, for example (it is easier in matlab
I guess since matlab has a jvm and weka is in java).
More information about the Scipy-dev