[Scipy-tickets] [SciPy] #612: New, "better" cluster package
SciPy
scipy-tickets@scipy....
Mon Feb 25 18:41:07 CST 2008
#612: New, "better" cluster package
---------------------------+------------------------------------------------
Reporter: rspringuel | Owner: somebody
Type: enhancement | Status: new
Priority: normal | Milestone: 0.7
Component: scipy.cluster | Version:
Severity: normal | Keywords:
---------------------------+------------------------------------------------
The current cluster package for scipy implements only k-means clustering
for a euclidean distance metric with centroids computed as the mean of the
members of each cluster.
Pycluster implements 8 different distance metrics, three different
centroid methods (mean, median, and medoid) and four different clustering
algorithms (k-means, agglomerative hierarchical, SOM, and PCA) but
contains non-BSD style license compatible elements and so cannot be
incorporated into scipy.
Using Pycluster as a model, I have written a clustering package from
scratch that duplicates most of its functionality and expands on it in
certain areas as well.
I have also endevored to design the package to make future expansions easy
and straight forward.
To date what I have written supports the following:
Distances:
euclidean
normalized euclidean
city block (aka manhattan)
normalized city block
hamming (aka simple mapping coefficient)
pearson
absolute pearson
uncentered pearson
arccosine of pearson
absolute uncentered pearson
spearman
kendall
modified simple matching coefficent of Rogers and Tanimoto
modified simple matching coefficent of Sokal and Sneath
jaccard coefficent
modified jaccard coefficent of Dice
modified jaccard coefficent of Sokal and Sneath
general Minkowski metric
Chebychev distance
Centroids:
arithmetic mean
median
absolute mean
geometric mean
harmonic mean
quadratic mean
mediod (using any of the above distances)
Clustering algorithms:
k-means
c-means (fuzzy clustering)
agglomerative hierarchical
Additional distances, centroid methods, and clustering algorithms may be
added as my work requires them.
Since I have written all of this code from scratch, I control the license
to it and have elected to release it under a BSD-style license so that it
can be incorporated into scipy.
What I have written obviously greatly expands on the functionality of the
current cluster package, but it is written entirely in python as so may be
slower than what is currently present where the functionality overlaps
(hence the quotation marks around "better" in the title of this ticket).
Note: All of the centroid methods are based on my statistical functions
submitted in ticket #604 and so the code would have to be revised should
those stats functions not be incorporated into scipy.
--
Ticket URL: <http://scipy.org/scipy/scipy/ticket/612>
SciPy <http://www.scipy.org/>
SciPy is open-source software for mathematics, science, and engineering.
More information about the Scipy-tickets
mailing list