[Scipy-svn] r4186 - trunk/scipy/cluster

scipy-svn@scip... scipy-svn@scip...
Sun Apr 27 07:18:50 CDT 2008


Author: damian.eads
Date: 2008-04-27 07:18:48 -0500 (Sun, 27 Apr 2008)
New Revision: 4186

Modified:
   trunk/scipy/cluster/vq.py
Log:
Cleaned up documentation. Removed cerrtain references to optimal/minimal distortion solutions, which cannot be claimed. Made terminology more consistent. Improved grammar and usage. Wrote summary of k-means and quantization.

Modified: trunk/scipy/cluster/vq.py
===================================================================
--- trunk/scipy/cluster/vq.py	2008-04-27 09:39:47 UTC (rev 4185)
+++ trunk/scipy/cluster/vq.py	2008-04-27 12:18:48 UTC (rev 4186)
@@ -1,18 +1,60 @@
-""" Vector Quantization Module
+""" K-means Clustering and Vector Quantization Module
 
-    Provides several routines used in creating a code book from a set of
-    observations and comparing a set of observations to a code book.
+    Provides routines for performing k-means clustering and vector
+    quantization.
 
+    The k-means algorithm takes as input the number of clusters to
+    generate k and a set of observation vectors to cluster.  It
+    returns as its model a set of centroids, one for each of the k
+    clusters.  An observation vector is classified with the cluster
+    number or centroid index of the centroid closest to it. The
+    cluster is defined as the set of all points closest to the
+    centroid of the cluster.
+
+    Since vector quantization is a natural application for k-means,
+    and vector quantization is often a subject of information theory,
+    the terminology for the latter two are often used in describing
+    k-means.  The centroid or cluster index is often referred to as
+    a "code" and the mapping table from codes to centroids is often
+    referred to as a "code book".
+
+    The result of k-means, a set of centroids, is often used to
+    quantize vectors. Quantization aims to find an encoding that
+    reduces information loss or distortion. The centroids represent
+    the center of mass of the clusters they define. Each step of
+    the k-means algorithm refines the choices of centroids to
+    reduce distortion.  When change in distortion is lower than
+    a threshold, the k-means algorithm has converged.
+
+    For example, suppose we wish to compress a 24-bit per pixel color
+    image before sending it over the web. Each pixel value is
+    represented by three bytes, one each for red, green, and blue. By
+    using a smaller 8-bit encoding, we can reduce the data to send by
+    two thirds. Ideally, the colors for each of the 256 possible 8-bit
+    encoding values should be chosen to minimize distortion of the
+    color. By running k-means with k=256, we generate a code book of
+    256 codes, one for every 8-bit sequence.  Instead of sending a
+    3-byte value for each pixel, the centroid index (or code word) of
+    the centroid closest to it is is transmitted. The code book is
+    also sent over the wire so each received pixel value, represented
+    as a centroid index, can be translated back into its 24-bit
+    representation.
+
+    This module provides routines for k-means clustering, generating
+    code books from k-means, and quantizing vectors by comparing
+    them to centroids in a code book.
+
     All routines expect an "observation vector" to be stored in each
-    row of the obs matrix.  Similarly the codes are stored row wise
-    in the code book matrix.
+    row of the obs matrix.  Similarly the centroids corresponding to
+    the codes are stored as rows of the code_book matrix. The i'th
+    index is the code corresponding to the code_book[i] centroid.
 
     whiten(obs) --
-        Normalize a group of observations on a per feature basis
+        Normalize a group of observations so each feature has unit variance.
     vq(obs,code_book) --
-        Calculate code book membership of obs
+        Calculate code book membership of obs.
     kmeans(obs,k_or_guess,iter=20,thresh=1e-5) --
-        Train a codebook for mimimum distortion using the kmeans algorithm
+        Train a codebook for mimimum distortion using the k-means algorithm.
     kmeans2
         Similar to kmeans, but with several initialization methods.
 
@@ -22,7 +64,7 @@
 __all__ = ['whiten', 'vq', 'kmeans', 'kmeans2']
 
 # TODO:
-#   - implements high level method for running several times kmeans with
+#   - implements high level method for running several times k-means with
 #   different initialialization
 #   - warning: what happens if different number of clusters ? For now, emit a
 #   warning, but it is not great, because I am not sure it really make sense to
@@ -41,15 +83,15 @@
 def whiten(obs):
     """ Normalize a group of observations on a per feature basis.
 
-    Before running kmeans algorithms, it is beneficial to "whiten", or
-    scale, the observation data on a per feature basis.  This is done
-    by dividing each feature by its standard deviation across all
-    observations.
+    Before running k-means, it is beneficial to rescale each feature
+    dimension of the observation set with whitening. Each feature is
+    divided by its standard deviation across all observations to give
+    it unit variance.
 
     :Parameters:
         obs : ndarray
             Each row of the array is an observation.  The
-            columns are the "features" seen during each observation
+            columns are the features seen during each observation.
             ::
 
                       #   f0    f1    f2
@@ -83,23 +125,25 @@
     return obs / std_dev
 
 def vq(obs, code_book):
-    """ Vector Quantization: assign features sets to codes in a code book.
+    """ Vector Quantization: assign codes from a code book to observations.
 
-    Vector quantization determines which code in the code book best represents
-    an observation of a target.  The features of each observation are compared
-    to each code in the book, and assigned the one closest to it.  The
-    observations are contained in the obs array. These features should be
-    "whitened," or nomalized by the standard deviation of all the features
-    before being quantized.  The code book can be created using the kmeans
-    algorithm or something similar.
+    Assigns a code from a code book to each observation. Each
+    observation vector in the MxN obs array is compared with the
+    centroids in the code book and assigned the code of the closest
+    centroid.
 
+    The features in obs should have unit variance, which can be
+    acheived by passing them through the whiten function.  The code
+    book can be created with the k-means algorithm or a different
+    encoding algorithm.
+
     :Parameters:
         obs : ndarray
-            Each row of the array is an observation.  The columns are the
-            "features" seen during each observation The features must be
+            Each row of the NxM array is an observation.  The columns are the
+            "features" seen during each observation. The features must be
             whitened first using the whiten function or something equivalent.
         code_book : ndarray.
-            The code book is usually generated using the kmeans algorithm.
+            The code book is usually generated using the k-means algorithm.
             Each row of the array holds a different code, and the columns are
             the features of the code.
 
@@ -112,15 +156,14 @@
 
     :Returns:
         code : ndarray
-            If obs is a NxM array, then a length N array is returned that holds
-            the selected code book index for each observation.
+            A length N array holding the code book index for each observation.
         dist : ndarray
             The distortion (distance) between the observation and its nearest
-            code
+            code.
 
     Notes
     -----
-    This currently forces 32 bit math precision for speed.  Anyone know
+    This currently forces 32-bit math precision for speed.  Anyone know
     of a situation where this undermines the accuracy of the algorithm?
 
     Examples
@@ -154,7 +197,7 @@
 def py_vq(obs, code_book):
     """ Python version of vq algorithm.
 
-    The algorithm simply computes the euclidian distance between each
+    The algorithm computes the euclidian distance between each
     observation and every frame in the code_book.
 
     :Parameters:
@@ -165,11 +208,11 @@
             features (eg columns) than obs.
 
     :Note:
-        This function is slower than the C versions, but it works for
+        This function is slower than the C version but works for
         all input types.  If the inputs have the wrong types for the
         C versions of the function, this one is called as a last resort.
 
-        Its about 20 times slower than the C versions.
+        It is about 20 times slower than the C version.
 
     :Returns:
         code : ndarray
@@ -284,7 +327,7 @@
     return code, min_dist
 
 def _kmeans(obs, guess, thresh=1e-5):
-    """ "raw" version of kmeans.
+    """ "raw" version of k-means.
 
     :Returns:
         code_book :
@@ -294,7 +337,7 @@
             Lower means the code_book matches the data better.
 
     :SeeAlso:
-        - kmeans : wrapper around kmeans
+        - kmeans : wrapper around k-means
 
     XXX should have an axis variable here.
 
@@ -341,34 +384,55 @@
     return code_book, avg_dist[-1]
 
 def kmeans(obs, k_or_guess, iter=20, thresh=1e-5):
-    """Generate a code book with minimum distortion.
+    """Performs k-means on a set of observations for a specified number of
+       iterations. This yields a code book mapping centroids to codes
+       and vice versa. The k-means algorithm adjusts the centroids
+       until the change in distortion caused by quantizing the
+       observation is less than some threshold.
 
     :Parameters:
         obs : ndarray
-            Each row of the array is an observation.  The columns are the
-            "features" seen during each observation The features must be
-            whitened first using the whiten function or something equivalent.
+            Each row of the M by N array is an observation.  The columns are the
+            "features" seen during each observation.  The features must be
+            whitened first with the whiten function.
         k_or_guess : int or ndarray
-            If integer, it is the number of code book elements.  If a 2D array,
-            the array is used as the intial guess for the code book.  The array
-            should have k rows, and the same number of columns (features) as
-            the obs array.
+            The number of centroids to generate. One code will be assigned
+            to each centroid, and it will be the row index in the code_book
+            matrix generated.
+
+            The initial k centroids will be chosen by randomly
+            selecting observations from the observation
+            matrix. Alternatively, passing a k by N array specifies
+            the initial values of the k means.
+
         iter : int
-            The number of times to restart the kmeans algorithm with a new
-            initial guess.  If k_or_guess is a 2D array (codebook), this
-            argument is ignored and only 1 iteration is run.
+            The number of times to run k-means, returning the codebook
+            with the lowest distortion. This argument is ignored if
+            initial mean values are specified with an array for the
+            k_or_guess paramter. This parameter does not represent the
+            number of iterations of the k-means algorithm.
+
         thresh : float
-            Terminate each kmeans run when the distortion change from one
-            iteration to the next is less than this value.
+            Terminates the k-means algorithm if the change in
+            distortion since the last k-means iteration is less than
+            thresh.
+
     :Returns:
-        codesbook : ndarray
-            The codes that best fit the observation
+        codebook : ndarray
+            A k by N array of k centroids. The i'th centroid
+            codebook[i] is represented with the code i. The centroids
+            and codes generated represent the lowest distortion seen,
+            not necessarily the global minimum distortion.
+
         distortion : float
-            The distortion between the observations and the codes.
+           The distortion between the observations passed and the
+           centroids generated.
 
     :SeeAlso:
         - kmeans2: similar function, but with more options for initialization,
           and returns label of each observation
+        - whiten: must be called prior to passing an observation matrix
+          to kmeans.
 
     Examples
     --------
@@ -499,46 +563,51 @@
 
 def kmeans2(data, k, iter = 10, thresh = 1e-5, minit = 'random',
         missing = 'warn'):
-    """Classify a set of points into k clusters using kmean algorithm.
+    """Classify a set of observations into k clusters using the k-means
+       algorithm.
 
-    The algorithm works by minimizing the euclidian distance between data points
-    of cluster means. This version is more complete than kmean (has several
-    initialisation methods).
+    The algorithm attempts to minimize the Euclidian distance between
+    observations and centroids. Several initialization methods are
+    included.
 
     :Parameters:
         data : ndarray
-            Expect a rank 1 or 2 array. Rank 1 are assumed to describe one
-            dimensional data, rank 2 multidimensional data, in which case one
-            row is one observation.
+            A M by N array of M observations in N dimensions or a length
+            M array of M one-dimensional observations.
         k : int or ndarray
-            Number of clusters. If minit arg is 'matrix', or if a ndarray is
-            given instead, it is interpreted as initial cluster to use instead.
+            The number of clusters to form as well as the number of
+            centroids to generate. If minit initialization string is
+            'matrix', or if a ndarray is given instead, it is
+            interpreted as initial cluster to use instead.
         niter : int
-            Number of iterations to run.
+            Number of iterations of k-means to run. Note that this
+            differs in meaning from the iters parameter to the kmeans
+            function.
         thresh : float
             (not used yet).
         minit : string
-            Method for initialization. Available methods are random, points and
-            uniform:
+            Method for initialization. Available methods are 'random',
+            'points', 'uniform', and 'matrix':
 
-            random uses k points drawn from a Gaussian random generator which
-            mean and variances are estimated from the data.
+            'random': generate k centroids from a Gaussian with mean and
+            variance estimated from the data.
 
-            points choses k points at random from the points in data.
+            'points': choose k observations (rows) at random from data for
+            the initial centroids.
 
-            uniform choses k points from the data such are they form a uniform
-            grid od the dataset (not supported yet).
+            'uniform': generate k observations from the data from a uniform
+            distribution defined by the data set (unsupported).
 
-            matrix means that k has to be interpreted as initial clusters
-            (format is the same than data).
+            'matrix': interpret the k parameter as a k by M (or length k
+            array for one-dimensional data) array of initial centroids.
 
     :Returns:
-        clusters : ndarray
-            the found clusters (one cluster per row).
+        centroid : ndarray
+            A k by N array of centroids found at the last iteration of
+            k-means.
         label : ndarray
-            label[i] gives the label of the ith obversation, that its centroid is
-            cluster[label[i]].
-
+            label[i] is the code or index of the centroid the
+            i'th observation is closest to.
     """
     if missing not in _valid_miss_meth.keys():
         raise ValueError("Unkown missing method: %s" % str(missing))
@@ -581,7 +650,7 @@
 def _kmeans2(data, code, niter, nc, missing):
     """ "raw" version of kmeans2. Do not use directly.
 
-    Run kmeans with a given initial codebook.  """
+    Run k-means with a given initial codebook.  """
     for i in range(niter):
         # Compute the nearest neighbour for each obs
         # using the current code book



More information about the Scipy-svn mailing list