[SciPy-dev] Hierarchical clustering package
Tue Nov 20 05:24:17 CST 2007
Damian Eads wrote:
> I developed a hierarchical clustering package, which I offer under the
> terms of the BSD License. It supports agglomerative clustering, plotting
> of dendrograms, flat cluster formation, computation of a few cluster
> statistics, and computing distances between vectors using a variety of
> distance metrics. The source code is available for your perusal at
> http://scipy-cluster.googlecode.com/svn/trunk/ and the API
> Documentation, http://www.soe.ucsc.edu/~eads/cluster.html . The
> interface is similar to the interface used in MATLAB's Statistics
> Toolbox to ease conversion of old MATLAB programs to Python. Eventually,
> I'd like to integrate it into Scipy (hence, naming my SVN repository
This looks great. I have a couple of questions:
- do you think it would be possible to split the package for the
reusable parts (in perticular, the distance matrices: scipy.cluster, and
a few other packages could reuse those).
- do you have some examples ?
I don't know what the opinon of others are on this, but maybe this
package could be added to scikits (there is already a scikits.learn
packages for ML-related algorithms, ANN, Em for mixtures of Gaussian,
and SVM) ?
> A few things:
> * matplotlib is optional: the only function that requires matplotlib
> support is dendrogram. However, I've abstracted the code so that
> ImportError exceptions are caught when importing matplotlib. This
> enables the package to be imported without any import errors. When an
> attempt is made to plot when matplotlib is unavailable, an exception is
> then thrown indicating that graphical rendering is not supported. The
> caller can pass a no_plot parameter to have the coordinates of the plot
> elements calculated without any rendering done in matplotlib.
I think this is one way of doing it (I am doing the same for my own
packages, at least).
> * The tests I'm writing require some data files to run. What is the
> convention for storing and retrieving data files when running a Scipy
> regression test? Presumably the test programs should be able to find the
> data files without regard to whether the data files are stored in
> /usr/share or in the src directory. One solution is to embed the data in
> the testing programs themselves but this is messy, and I'd like to know
> if there is a better solution.
The convention is to have the datasets in the package. I am not sure to
understand why it is messy: it is good to have self-contained regression
More information about the Scipy-dev