[SciPy-dev] Google Summer of Code and scipy.learn (another trying)

David Cournapeau david@ar.media.kyoto-u.ac...
Wed Mar 19 22:57:39 CDT 2008


Anton Slesarev wrote:
> Hi.
>
> I'm going to describe what problems I see in current version of 
> scikits.learn. After that I'll write what I want to improve during 
> Google Summer of Code. In my last letter I tried to numerate some 
> limitations in other open-source frameworks such as PyML and Orange.
>

Hi Anton,

    Sorry for the late answer, but I have been busy the last few days at 
a conference. Most of your points are valid, that is there are some 
useful bits in there now, but disparate. The main reason is that I 
started working on a new build system for numpy/scipy, which took all my 
free time. I intend to work again on scikits.learn soon, though.

Here are some comments:

> Let's start about Scikits.learn.
>
> First of all is a lack of documentation. I can find nothing beside 
> David Cournapeau proposal on google Summer of Code. Nothing in wiki 
> and nothing in maillist. There are few examples in svm, of course. But 
> it is very hard use only examples. 

scikits.machine.em and scikits.machine.svm are usable, but they do not 
have common usage. They do have docs, though (particularly 
scikits.machine.em: there is 15 pages pdf tutorial). Most examples do 
not work, but that's only because of some changes in the dataset format 
(see below), and that should be easy to change.

I think examples are really important. The problem with docs is the 
format and tools to generate them; other people will know better on this 
particular point. I know I don't like the current tools myself (epydoc, 
etc...), but that's just my opinion.

> I can't find parser of different data formats. Only for datasets. As I 
> understand datasets don't support sparse data format.

On dataset, there is something:

http://projects.scipy.org/scipy/scikits/browser/trunk/learn/scikits/learn/datasets/DATASET_PROPOSAL.txt

The basic idea is to NOT care about format. Just provides basic 
conventions, such as any tool who need dataset can use what they want 
through introspection. In particular, I do not see any problem wrt 
sparse data, but I have not thought a lot about it, so maybe I missed 
something.

So there is no format, and no parsing needs :) I think it is impossible 
to agree on something which is useful and usable by everybody (the 
proposal does not focus exclusively on machine learning), hence the lack 
of format specification.

I worked on a parser for arff (weka format), which could read most arff 
files I could throw at it. But recently, someone wrote a grammar for the 
arff format:

http://permalink.gmane.org/gmane.comp.ai.weka/11742

Unfortunately, I do not know anything about parsing and co, but I think 
anybody who studied computer science in some way would know how to get a 
python module to parse files following this grammar in no time, I guess.

> There is no common structure in ML package. It has scattered modules 
> such as svm, em, ann, but no main idea.

Here is how I see thing: different people have different usages. Some 
people like to try many different learning algorithms (need common 
structure, general tools, etc...), some just want to focus on one 
algorithm. It should be important to keep the different "learning" 
algorithms independant: scikits.machine.em should be usable 
independently, same for other scikits.machine modules; that is, there 
should be one level at which you can just use the algorithms with 
straight numpy arrays (no dataset class or anything). Ideally, I want 
them to have a 100 % python implementation, so that they can be used for 
education purpose. Of course, some C/whatever implementation can be 
possible too, but that should be a complement.

So the main idea is to have pre-processing tools, and other things 
common to most algorithms on one side, and the actual learning 
algorithms on another side.

There is already some code to scale data (handling nan if necessary) in 
scikits/utils with some basic tests. That's really basic, though.

>
> If I mistake in understanding current state of affair you can correct me.
>
> Well, now about what I want to change.
>
> I am going to make learn package appropriate for text classification. 
> Also I want to copy most of PyML (pyml.sourceforge.net/ 
> <http://pyml.sourceforge.net/>) functionality.
>
> First of all we need sparse data format. I want to write parsers for a 
> number of common data formats.

What do you mean exactly by sparse data ? At the implementation level, 
ideally, the algorithms should use scipy.sparse, I think. At the "high" 
level, something like spider should be used:

http://www.kyb.tuebingen.mpg.de/bs/people/spider/main.html

Spider is quite good, and has a good interface, and this with matlab, 
which is quite an achievement.

Of course, it is your proposal, hence your choice, but I don't think 
focusing on many format to be the right thing at first. One thing I had 
in mind was to implement a "proxy" to communicate between the high level 
representation in scikits.learn and other packages such as weka, orange, 
etc... This would give a practical way use weka with some python tools, 
until we get something on our own for visualization. Spider does have 
something to communicate with weka, for example (it is easier in matlab 
I guess since matlab has a jvm and weka is in java).

cheers,

David


More information about the Scipy-dev mailing list