[Scipysvn] r6884  trunk/scipy/cluster
scipysvn@scip...
scipysvn@scip...
Sun Nov 14 03:58:41 CST 2010
Author: rgommers
Date: 20101114 03:58:41 0600 (Sun, 14 Nov 2010)
New Revision: 6884
Modified:
trunk/scipy/cluster/__init__.py
trunk/scipy/cluster/hierarchy.py
trunk/scipy/cluster/vq.py
Log:
DOC: merge wiki edits for cluster module.
Modified: trunk/scipy/cluster/__init__.py
===================================================================
 trunk/scipy/cluster/__init__.py 20101114 09:58:13 UTC (rev 6883)
+++ trunk/scipy/cluster/__init__.py 20101114 09:58:41 UTC (rev 6884)
@@ 1,3 +1,20 @@
+"""
+Vector Quantization / Kmeans
+============================
+Clustering algorithms are useful in information theory, target detection,
+communications, compression, and other areas. The `vq` module only
+supports vector quantization and the kmeans algorithms. Development of
+selforganizing maps (SOM) and other approaches is underway.
+
+Hierarchical Clustering
+=======================
+The `hierarchy` module provides functions for hierarchical and
+agglomerative clustering. Its features include generating hierarchical
+clusters from distance matrices, computing distance matrices from
+observation vectors, calculating statistics on clusters, cutting linkages
+to generate flat clusters, and visualizing clusters with dendrograms.
+
+"""
#
# spatial  Distances
#
Modified: trunk/scipy/cluster/hierarchy.py
===================================================================
 trunk/scipy/cluster/hierarchy.py 20101114 09:58:13 UTC (rev 6883)
+++ trunk/scipy/cluster/hierarchy.py 20101114 09:58:41 UTC (rev 6884)
@@ 6,109 +6,75 @@
or find the roots of the forest formed by a cut by providing the flat
cluster ids of each observation.
+++
*Function*  *Description* 
+++
fcluster forms flat clusters from hierarchical clusters. 
+++
fclusterdata forms flat clusters directly from data. 
+++
leaders singleton root nodes for flat cluster. 
+++
+.. autosummary::
+ :toctree: generated/
+ fcluster
+ fclusterdata
+ leaders
+
These are routines for agglomerative clustering.
+++
*Function*  *Description* 
+++
linkage agglomeratively clusters original observations. 
+++
single the single/min/nearest algorithm. (alias) 
+++
complete the complete/max/farthest algorithm. (alias) 
+++
average the average/UPGMA algorithm. (alias) 
+++
weighted the weighted/WPGMA algorithm. (alias) 
+++
centroid the centroid/UPGMC algorithm. (alias) 
+++
median the median/WPGMC algorithm. (alias) 
+++
ward the Ward/incremental algorithm. (alias) 
+++
+.. autosummary::
+ :toctree: generated/
+ linkage
+ single
+ complete
+ average
+ weighted
+ centroid
+ median
+ ward
+
These routines compute statistics on hierarchies.
+++
*Function*  *Description* 
+++
cophenet computes the cophenetic distance between leaves. 
+++
from_mlab_linkage converts a linkage produced by MATLAB(TM). 
+++
inconsistent the inconsistency coefficients for cluster. 
+++
maxinconsts the maximum inconsistency coefficient for each 
 cluster. 
+++
maxdists the maximum distance for each cluster. 
+++
maxRstat the maximum specific statistic for each cluster. 
+++
to_mlab_linkage converts a linkage to one MATLAB(TM) can 
 understand. 
+++
+.. autosummary::
+ :toctree: generated/
+ cophenet
+ from_mlab_linkage
+ inconsistent
+ maxinconsts
+ maxdists
+ maxRstat
+ to_mlab_linkage
+
Routines for visualizing flat clusters.
+++
*Function*  *Description* 
+++
dendrogram visualizes linkages (requires matplotlib). 
+++
+.. autosummary::
+ :toctree: generated/
+ dendrogram
+
These are data structures and routines for representing hierarchies as
tree objects.
+++
*Function*  *Description* 
+++
ClusterNode represents cluster nodes in a cluster hierarchy. 
+++
leaves_list a lefttoright traversal of the leaves. 
+++
to_tree represents a linkage matrix as a tree object. 
+++
+.. autosummary::
+ :toctree: generated/
+ ClusterNode
+ leaves_list
+ to_tree
+
These are predicates for checking the validity of linkage and
inconsistency matrices as well as for checking isomorphism of two
flat cluster assignments.
+++
*Function*  *Description* 
+++
is_valid_im checks for a valid inconsistency matrix. 
+++
is_valid_linkage checks for a valid hierarchical clustering. 
+++
is_isomorphic checks if two flat clusterings are isomorphic. 
+++
is_monotonic checks if a linkage is monotonic. 
+++
correspond checks whether a condensed distance matrix 
 corresponds with a linkage 
+++
num_obs_linkage the number of observations corresponding to a 
 linkage matrix. 
+++
+.. autosummary::
+ :toctree: generated/
+ is_valid_im
+ is_valid_linkage
+ is_isomorphic
+ is_monotonic
+ correspond
+ num_obs_linkage
* MATLAB and MathWorks are registered trademarks of The MathWorks, Inc.
* Mathematica is a registered trademark of The Wolfram Research, Inc.

References

@@ 1332,38 +1298,40 @@
def fcluster(Z, t, criterion='inconsistent', depth=2, R=None, monocrit=None):
"""
Forms flat clusters from the hierarchical clustering defined by
 the linkage matrix ``Z``. The threshold ``t`` is a required parameter.
+ the linkage matrix ``Z``.
 :Arguments:
+ Parameters
+ 
+ Z : ndarray
+ The hierarchical clustering encoded with the matrix returned
+ by the `linkage` function.
+ t : float
+ The threshold to apply when forming flat clusters.
+ criterion : str, optional
+ The criterion to use in forming flat clusters. This can
+ be any of the following values:
  Z : ndarray
 The hierarchical clustering encoded with the matrix returned
 by the ``linkage`` function.

  t : double
 The threshold to apply when forming flat clusters.

  criterion : string (optional)
 The criterion to use in forming flat clusters. This can
 be any of the following values:

 * 'inconsistent': If a cluster node and all its
 decendents have an inconsistent value less than or equal
 to ``t`` then all its leaf descendents belong to the
+ 'inconsistent':
+ If a cluster node and all its
+ descendants have an inconsistent value less than or equal
+ to ``t`` then all its leaf descendants belong to the
same flat cluster. When no nonsingleton cluster meets
this criterion, every node is assigned to its own
cluster. (Default)
 * 'distance': Forms flat clusters so that the original
+ 'distance':
+ Forms flat clusters so that the original
observations in each flat cluster have no greater a
cophenetic distance than ``t``.
 * 'maxclust': Finds a minimum threshold ``r`` so that
+ 'maxclust':
+ Finds a minimum threshold ``r`` so that
the cophenetic distance between any two original
observations in the same flat cluster is no more than
``r`` and no more than ``t`` flat clusters are formed.
 * 'monocrit': Forms a flat cluster from a cluster node c
+ 'monocrit':
+ Forms a flat cluster from a cluster node c
with index i when ``monocrit[j] <= t``.
For example, to threshold on the maximum mean distance
@@ 1373,38 +1341,38 @@
MR = maxRstat(Z, R, 3)
cluster(Z, t=0.8, criterion='monocrit', monocrit=MR)
 * 'maxclust_monocrit': Forms a flat cluster from a
+ 'maxclust_monocrit':
+ Forms a flat cluster from a
nonsingleton cluster node ``c`` when ``monocrit[i] <=
r`` for all cluster indices ``i`` below and including
``c``. ``r`` is minimized such that no more than ``t``
flat clusters are formed. monocrit must be
monotonic. For example, to minimize the threshold t on
maximum inconsistency values so that no more than 3 flat
 clusters are formed, do:
+ clusters are formed, do::
MI = maxinconsts(Z, R)
cluster(Z, t=3, criterion='maxclust_monocrit', monocrit=MI)
  depth : int (optional)
 The maximum depth to perform the inconsistency calculation.
 It has no meaning for the other criteria. (default=2)
+ depth : int, optional
+ The maximum depth to perform the inconsistency calculation.
+ It has no meaning for the other criteria. Default is 2.
+ R : ndarray, optional
+ The inconsistency matrix to use for the 'inconsistent'
+ criterion. This matrix is computed if not provided.
+ monocrit : ndarray, optional
+ An array of length n1. ``monocrit[i]`` is the
+ statistics upon which nonsingleton i is thresholded. The
+ monocrit vector must be monotonic, i.e. given a node c with
+ index i, for all node indices j corresponding to nodes
+ below c, ``monocrit[i] >= monocrit[j]``.
  R : ndarray (optional)
 The inconsistency matrix to use for the 'inconsistent'
 criterion. This matrix is computed if not provided.
+ Returns
+ 
+ fcluster : ndarray
+ An array of length n. T[i] is the flat cluster number to
+ which original observation i belongs.
  monocrit : ndarray (optional)
 A ``(n1)`` numpy vector of doubles. ``monocrit[i]`` is the
 statistics upon which nonsingleton ``i`` is thresholded. The
 monocrit vector must be monotonic, i.e. given a node ``c`` with
 index ``i``, for all node indices j corresponding to nodes
 below ``c``, ``monocrit[i] >= monocrit[j]``.

 :Returns:

  T : ndarray
 A vector of length ``n``. ``T[i]`` is the flat cluster number to
 which original observation ``i`` belongs.
"""
Z = np.asarray(Z, order='c')
is_valid_linkage(Z, throw=True, name='Z')
@@ 1444,67 +1412,56 @@
def fclusterdata(X, t, criterion='inconsistent', \
metric='euclidean', depth=2, method='single', R=None):
"""
 ``T = fclusterdata(X, t)``
+ Cluster observation data using a given metric.
 Clusters the original observations in the ``n`` by ``m`` data
 matrix ``X`` (``n`` observations in ``m`` dimensions), using the
 euclidean distance metric to calculate distances between original
 observations, performs hierarchical clustering using the single
 linkage algorithm, and forms flat clusters using the inconsistency
 method with t as the cutoff threshold.
+ Clusters the original observations in the nbym data
+ matrix X (n observations in m dimensions), using the euclidean
+ distance metric to calculate distances between original observations,
+ performs hierarchical clustering using the single linkage algorithm,
+ and forms flat clusters using the inconsistency method with `t` as the
+ cutoff threshold.
 A onedimensional numpy array ``T`` of length ``n`` is
 returned. ``T[i]`` is the index of the flat cluster to which the
 original observation ``i`` belongs.
+ A onedimensional array T of length n is returned. T[i] is the index
+ of the flat cluster to which the original observation i belongs.
 :Arguments:
+ Parameters
+ 
+ X : ndarray
+ n by m data matrix with n observations in m dimensions.
+ t : float
+ The threshold to apply when forming flat clusters.
+ criterion : str, optional
+ Specifies the criterion for forming flat clusters. Valid
+ values are 'inconsistent' (default), 'distance', or 'maxclust'
+ cluster formation algorithms. See `fcluster` for descriptions.
+ method : str, optional
+ The linkage method to use (single, complete, average,
+ weighted, median centroid, ward). See `linkage` for more
+ information. Default is "single".
+ metric : str, optional
+ The distance metric for calculating pairwise distances. See
+ `distance.pdist` for descriptions and linkage to verify
+ compatibility with the linkage method.
+ t : double, optional
+ The cutoff threshold for the cluster function or the
+ maximum number of clusters (criterion='maxclust').
+ depth : int, optional
+ The maximum depth for the inconsistency calculation. See
+ `inconsistent` for more information.
+ R : ndarray, optional
+ The inconsistency matrix. It will be computed if necessary
+ if it is not passed.
  X : ndarray
 ``n`` by ``m`` data matrix with ``n`` observations in ``m``
 dimensions.
+ Returns
+ 
+ T : ndarray
+ A vector of length n. T[i] is the flat cluster number to
+ which original observation i belongs.
  t : double
 The threshold to apply when forming flat clusters.

  criterion : string
 Specifies the criterion for forming flat clusters. Valid
 values are 'inconsistent', 'distance', or 'maxclust' cluster
 formation algorithms. See ``fcluster`` for descriptions.

  method : string
 The linkage method to use (single, complete, average,
 weighted, median centroid, ward). See ``linkage`` for more
 information.

  metric : string
 The distance metric for calculating pairwise distances. See
 distance.pdist for descriptions and linkage to verify
 compatibility with the linkage method.

  t : double
 The cutoff threshold for the cluster function or the
 maximum number of clusters (criterion='maxclust').

  depth : int
 The maximum depth for the inconsistency calculation. See
 ``inconsistent`` for more information.

  R : ndarray
 The inconsistency matrix. It will be computed if necessary
 if it is not passed.


 :Returns:

  T : ndarray
 A vector of length ``n``. ``T[i]`` is the flat cluster number to
 which original observation ``i`` belongs.

Notes

+ This function is similar to the MATLAB function clusterdata.
 This function is similar to MATLAB(TM) clusterdata function.

"""
X = np.asarray(X, order='c', dtype=np.double)
Modified: trunk/scipy/cluster/vq.py
===================================================================
 trunk/scipy/cluster/vq.py 20101114 09:58:13 UTC (rev 6883)
+++ trunk/scipy/cluster/vq.py 20101114 09:58:41 UTC (rev 6884)
@@ 1,71 +1,77 @@
""" Kmeans Clustering and Vector Quantization Module
+"""
+Kmeans Clustering and Vector Quantization Module
 Provides routines for kmeans clustering, generating code books
 from kmeans models, and quantizing vectors by comparing them with
 centroids in a code book.
+Provides routines for kmeans clustering, generating code books
+from kmeans models, and quantizing vectors by comparing them with
+centroids in a code book.
 The kmeans algorithm takes as input the number of clusters to
 generate, k, and a set of observation vectors to cluster. It
 returns a set of centroids, one for each of the k clusters. An
 observation vector is classified with the cluster number or
 centroid index of the centroid closest to it.
+The kmeans algorithm takes as input the number of clusters to
+generate, k, and a set of observation vectors to cluster. It
+returns a set of centroids, one for each of the k clusters. An
+observation vector is classified with the cluster number or
+centroid index of the centroid closest to it.
 A vector v belongs to cluster i if it is closer to centroid i than
 any other centroids. If v belongs to i, we say centroid i is the
 dominating centroid of v. Common variants of kmeans try to
 minimize distortion, which is defined as the sum of the distances
 between each observation vector and its dominating centroid. Each
 step of the kmeans algorithm refines the choices of centroids to
 reduce distortion. The change in distortion is often used as a
 stopping criterion: when the change is lower than a threshold, the
 kmeans algorithm is not making sufficient progress and
 terminates.
+A vector v belongs to cluster i if it is closer to centroid i than
+any other centroids. If v belongs to i, we say centroid i is the
+dominating centroid of v. The kmeans algorithm tries to
+minimize distortion, which is defined as the sum of the squared distances
+between each observation vector and its dominating centroid. Each
+step of the kmeans algorithm refines the choices of centroids to
+reduce distortion. The change in distortion is used as a
+stopping criterion: when the change is lower than a threshold, the
+kmeans algorithm is not making sufficient progress and
+terminates. One can also define a maximum number of iterations.
 Since vector quantization is a natural application for kmeans,
 information theory terminology is often used. The centroid index
 or cluster index is also referred to as a "code" and the table
 mapping codes to centroids and vice versa is often referred as a
 "code book". The result of kmeans, a set of centroids, can be
 used to quantize vectors. Quantization aims to find an encoding of
 vectors that reduces the expected distortion.
+Since vector quantization is a natural application for kmeans,
+information theory terminology is often used. The centroid index
+or cluster index is also referred to as a "code" and the table
+mapping codes to centroids and vice versa is often referred as a
+"code book". The result of kmeans, a set of centroids, can be
+used to quantize vectors. Quantization aims to find an encoding of
+vectors that reduces the expected distortion.
 For example, suppose we wish to compress a 24bit color image
 (each pixel is represented by one byte for red, one for blue, and
 one for green) before sending it over the web. By using a smaller
 8bit encoding, we can reduce the amount of data by two
 thirds. Ideally, the colors for each of the 256 possible 8bit
 encoding values should be chosen to minimize distortion of the
 color. Running kmeans with k=256 generates a code book of 256
 codes, which fills up all possible 8bit sequences. Instead of
 sending a 3byte value for each pixel, the 8bit centroid index
 (or code word) of the dominating centroid is transmitted. The code
 book is also sent over the wire so each 8bit code can be
 translated back to a 24bit pixel value representation. If the
 image of interest was of an ocean, we would expect many 24bit
 blues to be represented by 8bit codes. If it was an image of a
 human face, more flesh tone colors would be represented in the
 code book.
+All routines expect obs to be a M by N array where the rows are
+the observation vectors. The codebook is a k by N array where the
+i'th row is the centroid of code word i. The observation vectors
+and centroids have the same feature dimension.
 All routines expect obs to be a M by N array where the rows are
 the observation vectors. The codebook is a k by N array where the
 i'th row is the centroid of code word i. The observation vectors
 and centroids have the same feature dimension.
+As an example, suppose we wish to compress a 24bit color image
+(each pixel is represented by one byte for red, one for blue, and
+one for green) before sending it over the web. By using a smaller
+8bit encoding, we can reduce the amount of data by two
+thirds. Ideally, the colors for each of the 256 possible 8bit
+encoding values should be chosen to minimize distortion of the
+color. Running kmeans with k=256 generates a code book of 256
+codes, which fills up all possible 8bit sequences. Instead of
+sending a 3byte value for each pixel, the 8bit centroid index
+(or code word) of the dominating centroid is transmitted. The code
+book is also sent over the wire so each 8bit code can be
+translated back to a 24bit pixel value representation. If the
+image of interest was of an ocean, we would expect many 24bit
+blues to be represented by 8bit codes. If it was an image of a
+human face, more flesh tone colors would be represented in the
+code book.
 whiten(obs) 
 Normalize a group of observations so each feature has unit
 variance.
 vq(obs,code_book) 
 Calculate code book membership of a set of observation
 vectors.
 kmeans(obs,k_or_guess,iter=20,thresh=1e5) 
 Clusters a set of observation vectors. Learns centroids with
 the kmeans algorithm, trying to minimize distortion. A code
 book is generated that can be used to quantize vectors.
 kmeans2 
 A different implementation of kmeans with more methods for
 initializing centroids. Uses maximum number of iterations as
 opposed to a distortion threshold as its stopping criterion.
+Functions
+
+`whiten` :
+ Normalize a group of observations so each feature has unit
+ variance.
+`vq` :
+ Calculate code book membership of a set of observation
+ vectors.
+
+`kmeans` :
+ Clusters a set of observation vectors. Learns centroids with
+ the kmeans algorithm, trying to minimize distortion. A code
+ book is generated that can be used to quantize vectors.
+
+`kmeans2` :
+ A different implementation of kmeans with more methods for
+ initializing centroids. Uses maximum number of iterations as
+ opposed to a distortion threshold as its stopping criterion.
+
"""
__docformat__ = 'restructuredtext'
@@ 393,62 +399,68 @@
return code_book, avg_dist[1]
def kmeans(obs, k_or_guess, iter=20, thresh=1e5):
 """Performs kmeans on a set of observation vectors forming k
 clusters. This yields a code book mapping centroids to codes
 and vice versa. The kmeans algorithm adjusts the centroids
 until sufficient progress cannot be made, i.e. the change in
 distortion since the last iteration is less than some
 threshold.
+ """
+ Performs kmeans on a set of observation vectors forming k clusters.
 :Parameters:
 obs : ndarray
 Each row of the M by N array is an observation vector. The
 columns are the features seen during each observation.
 The features must be whitened first with the whiten
 function.
+ The kmeans algorithm adjusts the centroids until sufficient
+ progress cannot be made, i.e. the change in distortion since
+ the last iteration is less than some threshold. This yields
+ a code book mapping centroids to codes and vice versa.
 k_or_guess : int or ndarray
 The number of centroids to generate. A code is assigned to
 each centroid, which is also the row index of the centroid
 in the code_book matrix generated.
+ Distortion is defined as the sum of the squared differences
+ between the observations and the corresponding centroid.
 The initial k centroids are chosen by randomly selecting
 observations from the observation matrix. Alternatively,
 passing a k by N array specifies the initial k centroids.
+ Parameters
+ 
+ obs : ndarray
+ Each row of the M by N array is an observation vector. The
+ columns are the features seen during each observation.
+ The features must be whitened first with the `whiten` function.
 iter : int
 The number of times to run kmeans, returning the codebook
 with the lowest distortion. This argument is ignored if
 initial centroids are specified with an array for the
 k_or_guess paramter. This parameter does not represent the
 number of iterations of the kmeans algorithm.
+ k_or_guess : int or ndarray
+ The number of centroids to generate. A code is assigned to
+ each centroid, which is also the row index of the centroid
+ in the code_book matrix generated.
 thresh : float
 Terminates the kmeans algorithm if the change in
 distortion since the last kmeans iteration is less than
 thresh.
+ The initial k centroids are chosen by randomly selecting
+ observations from the observation matrix. Alternatively,
+ passing a k by N array specifies the initial k centroids.
 :Returns:
 codebook : ndarray
 A k by N array of k centroids. The i'th centroid
 codebook[i] is represented with the code i. The centroids
 and codes generated represent the lowest distortion seen,
 not necessarily the globally minimal distortion.
+ iter : int, optional
+ The number of times to run kmeans, returning the codebook
+ with the lowest distortion. This argument is ignored if
+ initial centroids are specified with an array for the
+ ``k_or_guess`` parameter. This parameter does not represent the
+ number of iterations of the kmeans algorithm.
 distortion : float
 The distortion between the observations passed and the
 centroids generated.
+ thresh : float, optional
+ Terminates the kmeans algorithm if the change in
+ distortion since the last kmeans iteration is less than
+ or equal to thresh.
 :SeeAlso:
  kmeans2: a different implementation of kmeans clustering
 with more methods for generating initial centroids but without
 using a distortion change threshold as a stopping criterion.
  whiten: must be called prior to passing an observation matrix
 to kmeans.
+ Returns
+ 
+ codebook : ndarray
+ A k by N array of k centroids. The i'th centroid
+ codebook[i] is represented with the code i. The centroids
+ and codes generated represent the lowest distortion seen,
+ not necessarily the globally minimal distortion.
+ distortion : float
+ The distortion between the observations passed and the
+ centroids generated.
+
+ See Also
+ 
+ kmeans2 : a different implementation of kmeans clustering
+ with more methods for generating initial centroids but without
+ using a distortion change threshold as a stopping criterion.
+
+ whiten : must be called prior to passing an observation matrix
+ to kmeans.
+
Examples


>>> from numpy import array
>>> from scipy.cluster.vq import vq, kmeans, whiten
>>> features = array([[ 1.9,2.3],
More information about the Scipysvn
mailing list