[Scipy-svn] r4187 - trunk/scipy/cluster

scipy-svn@scip... scipy-svn@scip...
Sun Apr 27 07:48:27 CDT 2008


Author: damian.eads
Date: 2008-04-27 07:48:25 -0500 (Sun, 27 Apr 2008)
New Revision: 4187

Modified:
   trunk/scipy/cluster/vq.py
Log:
Tightening up the language of vq's module summary.

Modified: trunk/scipy/cluster/vq.py
===================================================================
--- trunk/scipy/cluster/vq.py	2008-04-27 12:18:48 UTC (rev 4186)
+++ trunk/scipy/cluster/vq.py	2008-04-27 12:48:25 UTC (rev 4187)
@@ -1,54 +1,54 @@
 """ K-means Clustering and Vector Quantization Module
 
-    Provides routines for performing k-means clustering and vector
-    quantization.
+    Provides routines for k-means clustering, generating code books
+    from k-means models, and quantizing vectors by comparing them with
+    centroids in a code book.
 
     The k-means algorithm takes as input the number of clusters to
     generate k and a set of observation vectors to cluster.  It
     returns as its model a set of centroids, one for each of the k
     clusters.  An observation vector is classified with the cluster
-    number or centroid index of the centroid closest to it. The
-    cluster is defined as the set of all points closest to the
-    centroid of the cluster.
+    number or centroid index of the centroid closest to it.
 
+    Most variants of k-means try to minimize distortion, which is
+    defined as the sum of the distances between each observation and
+    its dominating centroid. A vector belongs to a cluster i if it is
+    closer to centroid i than the other centroids. Each step of the
+    k-means algorithm refines the choices of centroids to reduce
+    distortion. The change in distortion is often used as a stopping
+    criterion: when the change is lower than a threshold, the k-means
+    algorithm is not making progress and terminates.
+
     Since vector quantization is a natural application for k-means,
-    and vector quantization is often a subject of information theory,
-    the terminology for the latter two are often used in describing
-    k-means.  The centroid or cluster index is often referred to as
-    a "code" and the mapping table from codes to centroids is often
-    referred to as a "code book".
+    information theory terminology is often used.  The centroid index
+    or cluster index is also referred to as a "code" and the table
+    mapping codes to centroids and vice versa is often referred as a
+    "code book". The result of k-means, a set of centroids, can be
+    used to quantize vectors. Quantization aims to find an encoding of
+    vectors that reduces the expected distortion.
 
-    The result of k-means, a set of centroids, is often used to
-    quantize vectors. Quantization aims to find an encoding that
-    reduces information loss or distortion. The centroids represent
-    the center of mass of the clusters they define. Each step of
-    the k-means algorithm refines the choices of centroids to
-    reduce distortion.  When change in distortion is lower than
-    a threshold, the k-means algorithm has converged.
-
-    For example, suppose we wish to compress a 24-bit per pixel color
-    image before sending it over the web. Each pixel value is
-    represented by three bytes, one each for red, green, and blue. By
-    using a smaller 8-bit encoding, we can reduce the data to send by
-    two thirds. Ideally, the colors for each of the 256 possible 8-bit
+    For example, suppose we wish to compress a 24-bit color image
+    (each pixel is represented by one byte for red, one for blue, and
+    one for green) before sending it over the web.  By using a smaller
+    8-bit encoding, we can reduce the data to send by two
+    thirds. Ideally, the colors for each of the 256 possible 8-bit
     encoding values should be chosen to minimize distortion of the
-    color. By running k-means with k=256, we generate a code book of
-    256 codes, one for every 8-bit sequence.  Instead of sending a
-    3-byte value for each pixel, the centroid index (or code word) of
-    the centroid closest to it is is transmitted. The code book is
-    also sent over the wire so each received pixel value, represented
-    as a centroid index, can be translated back into its 24-bit
-    representation.
+    color. Running k-means with k=256 generates a code book of 256
+    codes, which fills up all possible 8-bit sequences.  Instead of
+    sending a 3-byte value for each pixel, the 8-bit centroid index
+    (or code word) of the dominating centroid is transmitted. The code
+    book is also sent over the wire so each 8-bit code can be
+    translated back to a 24-bit pixel value representation. If the
+    image of interest was of an ocean, we would expect many 24-bit
+    blues to be represented by 8-bit codes. If it was an image of a
+    human face, more flesh tone colors would be represented in the
+    code book.
 
-    This module provides routines for k-means clustering, generating
-    code books from k-means, and quantizing vectors by comparing
-    them to centroids in a code book.
+    All routines expect the observation vectors to be stored as rows
+    in the obs matrix.  Similarly the centroids corresponding to the
+    codes are stored as rows of the code_book matrix.  The i'th index
+    is the code corresponding to the code_book[i] centroid.
 
-    All routines expect an "observation vector" to be stored in each
-    row of the obs matrix.  Similarly the centroids corresponding to
-    the codes are stored as rows of the code_book matrix. The i'th
-    index is the code corresponding to the code_book[i] centroid.
-
     whiten(obs) --
         Normalize a group of observations so each feature has unit variance.
     vq(obs,code_book) --



More information about the Scipy-svn mailing list