[SciPy-dev] Hierarchical clustering package
Damian Eads
eads@soe.ucsc....
Mon Nov 19 22:14:34 CST 2007
Hello,
I developed a hierarchical clustering package, which I offer under the
terms of the BSD License. It supports agglomerative clustering, plotting
of dendrograms, flat cluster formation, computation of a few cluster
statistics, and computing distances between vectors using a variety of
distance metrics. The source code is available for your perusal at
http://scipy-cluster.googlecode.com/svn/trunk/ and the API
Documentation, http://www.soe.ucsc.edu/~eads/cluster.html . The
interface is similar to the interface used in MATLAB's Statistics
Toolbox to ease conversion of old MATLAB programs to Python. Eventually,
I'd like to integrate it into Scipy (hence, naming my SVN repository
scipy-cluster).
A few things:
* matplotlib is optional: the only function that requires matplotlib
support is dendrogram. However, I've abstracted the code so that
ImportError exceptions are caught when importing matplotlib. This
enables the package to be imported without any import errors. When an
attempt is made to plot when matplotlib is unavailable, an exception is
then thrown indicating that graphical rendering is not supported. The
caller can pass a no_plot parameter to have the coordinates of the plot
elements calculated without any rendering done in matplotlib.
* Most of the algorithms are written in C. Half-way through the
development of the package, I mistakenly strided arrays using the
dimensions field of the PyArrayObject, and not the strides field. On
occasion, this causes erroneous behavior when the array passed refers to
a base array. As a work-around until I get around to rewriting the code
to use proper striding, if an array's base is non-null, it is copied
prior to being passed to a C function. I should also note that the task
of changing the code to use proper striding might be difficult since, in
many places in the code, I assume for efficiency and code
expressibility/readability sake, array elements are stored side-by-side
in the array's underlying buffer.
* When this project started, I used dtype='int32' when declaring
integer Numpy arrays, and then referred to them with an int* pointer in
C. I realize this might pose a problem on different architectures,
64-bit for example. When declaring a numpy array with dtype='int', does
Numpy use the host's compiler's sizeof(int) to determine the size of the
ints to allocate for the array? If so, I might as well change all the
instances of dtype='int32' to dtype='int' in my code. I can't really
justify why I originally did this but I know the code as it stands works
on my 32-bit Intel+gcc.
* I wrote the API documentation without any use of markup (epydoc or
the like). This is so that help(function) still provides human-readable
documentation. I noticed some discussion about reformatting Scipy's
docstrings to use the epydoc mark-up. A few weeks ago, I tried
contacting the author of epydoc to ask whether epydoc supports rendering
mark-up into human-readable ASCII text, and whether there were any plans
to enable such renderings when invoking python's help command. He has
yet to respond. I'm reluctant to use epydoc until I have assurance that
console-based help with human-readable text rendering is supported.
Thoughts on this?
* The tests I'm writing require some data files to run. What is the
convention for storing and retrieving data files when running a Scipy
regression test? Presumably the test programs should be able to find the
data files without regard to whether the data files are stored in
/usr/share or in the src directory. One solution is to embed the data in
the testing programs themselves but this is messy, and I'd like to know
if there is a better solution.
* The vector quantization/k-means Scipy package is already called
cluster so there is a naming conflict. If the hierarchical clustering
package is integrated into Scipy, I could rename it "agglom" or
"hierarchical", and have it sit in the cluster package directory.
Cheers,
Damian Eads
http://www.soe.ucsc.edu/~eads
More information about the Scipy-dev
mailing list