[SciPy-dev] Hierarchical clustering package

David Cournapeau david@ar.media.kyoto-u.ac...
Tue Nov 20 05:24:17 CST 2007


Damian Eads wrote:
> Hello,
>
> I developed a hierarchical clustering package, which I offer under the 
> terms of the BSD License. It supports agglomerative clustering, plotting 
> of dendrograms, flat cluster formation, computation of a few cluster 
> statistics, and computing distances between vectors using a variety of 
> distance metrics. The source code is available for your perusal at 
> http://scipy-cluster.googlecode.com/svn/trunk/ and the API 
> Documentation, http://www.soe.ucsc.edu/~eads/cluster.html . The 
> interface is similar to the interface used in MATLAB's Statistics 
> Toolbox to ease conversion of old MATLAB programs to Python. Eventually, 
> I'd like to integrate it into Scipy (hence, naming my SVN repository 
> scipy-cluster).
Hi Damian,

    This looks great. I have a couple of questions:

    - do you think it would be possible to split the package for the 
reusable parts (in perticular, the distance matrices: scipy.cluster, and 
a few other packages could reuse those).
    - do you have some examples ?

I don't know what the opinon of others are on this, but maybe this 
package could be added to scikits (there is already a scikits.learn 
packages for ML-related algorithms, ANN, Em for mixtures of Gaussian, 
and SVM) ?

>
> A few things:
>
>    * matplotlib is optional: the only function that requires matplotlib 
> support is dendrogram. However, I've abstracted the code so that 
> ImportError exceptions are caught when importing matplotlib. This 
> enables the package to be imported without any import errors. When an 
> attempt is made to plot when matplotlib is unavailable, an exception is 
> then thrown indicating that graphical rendering is not supported. The 
> caller can pass a no_plot parameter to have the coordinates of the plot 
> elements calculated without any rendering done in matplotlib.
I think this is one way of doing it (I am doing the same for my own 
packages, at least).
>
>    * The tests I'm writing require some data files to run. What is the 
> convention for storing and retrieving data files when running a Scipy 
> regression test? Presumably the test programs should be able to find the 
> data files without regard to whether the data files are stored in 
> /usr/share or in the src directory. One solution is to embed the data in 
> the testing programs themselves but this is messy, and I'd like to know 
> if there is a better solution.
The convention is to have the datasets in the package. I am not sure to 
understand why it is messy: it is good to have self-contained regression 
tests ?

David


More information about the Scipy-dev mailing list