[SciPy-dev] Hierarchical clustering package

Damian Eads eads@soe.ucsc....
Mon Nov 19 22:14:34 CST 2007


I developed a hierarchical clustering package, which I offer under the 
terms of the BSD License. It supports agglomerative clustering, plotting 
of dendrograms, flat cluster formation, computation of a few cluster 
statistics, and computing distances between vectors using a variety of 
distance metrics. The source code is available for your perusal at 
http://scipy-cluster.googlecode.com/svn/trunk/ and the API 
Documentation, http://www.soe.ucsc.edu/~eads/cluster.html . The 
interface is similar to the interface used in MATLAB's Statistics 
Toolbox to ease conversion of old MATLAB programs to Python. Eventually, 
I'd like to integrate it into Scipy (hence, naming my SVN repository 

A few things:

   * matplotlib is optional: the only function that requires matplotlib 
support is dendrogram. However, I've abstracted the code so that 
ImportError exceptions are caught when importing matplotlib. This 
enables the package to be imported without any import errors. When an 
attempt is made to plot when matplotlib is unavailable, an exception is 
then thrown indicating that graphical rendering is not supported. The 
caller can pass a no_plot parameter to have the coordinates of the plot 
elements calculated without any rendering done in matplotlib.

   * Most of the algorithms are written in C. Half-way through the 
development of the package, I mistakenly strided arrays using the 
dimensions field of the PyArrayObject, and not the strides field. On 
occasion, this causes erroneous behavior when the array passed refers to 
a base array. As a work-around until I get around to rewriting the code 
to use proper striding, if an array's base is non-null, it is copied 
prior to being passed to a C function. I should also note that the task 
of changing the code to use proper striding might be difficult since, in 
many places in the code, I assume for efficiency and code 
expressibility/readability sake, array elements are stored side-by-side 
in the array's underlying buffer.

   * When this project started, I used dtype='int32' when declaring 
integer Numpy arrays, and then referred to them with an int* pointer in 
C. I realize this might pose a problem on different architectures, 
64-bit for example. When declaring a numpy array with dtype='int', does 
Numpy use the host's compiler's sizeof(int) to determine the size of the 
ints to allocate for the array? If so, I might as well change all the 
instances of dtype='int32' to dtype='int' in my code. I can't really 
justify why I originally did this but I know the code as it stands works 
on my 32-bit Intel+gcc.

   * I wrote the API documentation without any use of markup (epydoc or 
the like). This is so that help(function) still provides human-readable 
documentation. I noticed some discussion about reformatting Scipy's 
docstrings to use the epydoc mark-up. A few weeks ago, I tried 
contacting the author of epydoc to ask whether epydoc supports rendering 
mark-up into human-readable ASCII text, and whether there were any plans 
to enable such renderings when invoking python's help command. He has 
yet to respond. I'm reluctant to use epydoc until I have assurance that 
console-based help with human-readable text rendering is supported. 
Thoughts on this?

   * The tests I'm writing require some data files to run. What is the 
convention for storing and retrieving data files when running a Scipy 
regression test? Presumably the test programs should be able to find the 
data files without regard to whether the data files are stored in 
/usr/share or in the src directory. One solution is to embed the data in 
the testing programs themselves but this is messy, and I'd like to know 
if there is a better solution.

   * The vector quantization/k-means Scipy package is already called 
cluster so there is a naming conflict. If the hierarchical clustering 
package is integrated into Scipy, I could rename it "agglom" or 
"hierarchical", and have it sit in the cluster package directory.


Damian Eads

More information about the Scipy-dev mailing list