[SciPy-user] scipy data mining ?

Karl Young Karl.Young at ucsf.edu
Wed Jan 24 11:48:16 CST 2007

David Cournapeau wrote:

>Ivan Vilata i Balaguer wrote:
>>Karl Young (el 2007-01-22 a les 14:16:56 -0800) va dir::
>>>I'm currently using a nice Java based data mining package called Weka 
>>>(essentially as a black box as I don't have time to learn Java)  but was 
>>>looking for something more python/scipy friendly to switch to as I'd 
>>>prefer more interactive use. I found a python package on the web that 
>>>potentially looks pretty nice (Orange - http://www.ailab.si/orange) but 
>>>given that it uses GPL (and also given the recent discussion on license 
>>>issues) and doesn't look to have made any effort to be numpy/scipy 
>>>friendly I was wondering if anyone was aware of a more scipy friendly 
>>>effort. Should someone (maybe even me...) be talked into contacting the 
>>>Orange developers and seeing if they'd be interested in a switch to BSD 
>>>and a gradual evolution towards integration with numpy... ?
>>You may also give a try to PyTables_, which is already being used by
>>some people to perform data mining.  It is not similar to Orange or Weka
>>in the sense that PyTables is a lower-level, non-GUI Python library.
>>However, it uses NumPy at its core, so integration with SciPy should be
>>no problem, and it is designed to be comfortable in interactive usage
>>(on a Python console).  The standard version is free/libre software
>>under a BSD license.
>>On the GUI part, you could use ViTables_ for textual browsing of big
>>files, or HDFView_ if you need plotting or image visualisation
>>.. _PyTables: http://www.pytables.org/
>>.. _ViTables: http://www.carabos.com/products/vitables
>>.. _HDFView: http://www.hdfgroup.org/hdf-java-html/hdfview/
>That would give the IO and interface part of orange, but not the core 
>machine learning part. This is I think one area where numpy/scipy is 
>still lacking, at least integration-wise, compared to matlab which has 
>major toolboxes such as netlab for this kind of thing.
Thanks for the suggestions but, yes, I feel that it's the set of core 
machine learning algorithms that is the important piece. There are some 
implementations of specific algorithms around but it seems like a lot of 
work to include a significant fraction of recently developed machine 
learning algorithms in a single package with a consistent interface (as 
the Weka guys have done). Since the (exploratory) tendency in data 
mining is to not trust any single algorithm for all data analysis it's 
important to have a range available. Given that the Orange developers 
have already started something like that it seemed reasonable to explore 
some kind of integration (I could just use it anyway and hack whatever I 
need to use it with numpy array utilities but some kind of moderated 
integration seemed more consistent with the scipy philosophy).

-- KY

Karl Young
Center for Imaging of Neurodegenerative Diseases, UCSF          
VA Medical Center (114M)              Phone:  (415) 221-4810 x3114  lab        
4150 Clement Street                   FAX:    (415) 668-2864
San Francisco, CA 94121               Email:  karl young at ucsf edu

More information about the SciPy-user mailing list