[Scipysvn] r4187  trunk/scipy/cluster
scipysvn@scip...
scipysvn@scip...
Sun Apr 27 07:48:27 CDT 2008
Author: damian.eads
Date: 20080427 07:48:25 0500 (Sun, 27 Apr 2008)
New Revision: 4187
Modified:
trunk/scipy/cluster/vq.py
Log:
Tightening up the language of vq's module summary.
Modified: trunk/scipy/cluster/vq.py
===================================================================
 trunk/scipy/cluster/vq.py 20080427 12:18:48 UTC (rev 4186)
+++ trunk/scipy/cluster/vq.py 20080427 12:48:25 UTC (rev 4187)
@@ 1,54 +1,54 @@
""" Kmeans Clustering and Vector Quantization Module
 Provides routines for performing kmeans clustering and vector
 quantization.
+ Provides routines for kmeans clustering, generating code books
+ from kmeans models, and quantizing vectors by comparing them with
+ centroids in a code book.
The kmeans algorithm takes as input the number of clusters to
generate k and a set of observation vectors to cluster. It
returns as its model a set of centroids, one for each of the k
clusters. An observation vector is classified with the cluster
 number or centroid index of the centroid closest to it. The
 cluster is defined as the set of all points closest to the
 centroid of the cluster.
+ number or centroid index of the centroid closest to it.
+ Most variants of kmeans try to minimize distortion, which is
+ defined as the sum of the distances between each observation and
+ its dominating centroid. A vector belongs to a cluster i if it is
+ closer to centroid i than the other centroids. Each step of the
+ kmeans algorithm refines the choices of centroids to reduce
+ distortion. The change in distortion is often used as a stopping
+ criterion: when the change is lower than a threshold, the kmeans
+ algorithm is not making progress and terminates.
+
Since vector quantization is a natural application for kmeans,
 and vector quantization is often a subject of information theory,
 the terminology for the latter two are often used in describing
 kmeans. The centroid or cluster index is often referred to as
 a "code" and the mapping table from codes to centroids is often
 referred to as a "code book".
+ information theory terminology is often used. The centroid index
+ or cluster index is also referred to as a "code" and the table
+ mapping codes to centroids and vice versa is often referred as a
+ "code book". The result of kmeans, a set of centroids, can be
+ used to quantize vectors. Quantization aims to find an encoding of
+ vectors that reduces the expected distortion.
 The result of kmeans, a set of centroids, is often used to
 quantize vectors. Quantization aims to find an encoding that
 reduces information loss or distortion. The centroids represent
 the center of mass of the clusters they define. Each step of
 the kmeans algorithm refines the choices of centroids to
 reduce distortion. When change in distortion is lower than
 a threshold, the kmeans algorithm has converged.

 For example, suppose we wish to compress a 24bit per pixel color
 image before sending it over the web. Each pixel value is
 represented by three bytes, one each for red, green, and blue. By
 using a smaller 8bit encoding, we can reduce the data to send by
 two thirds. Ideally, the colors for each of the 256 possible 8bit
+ For example, suppose we wish to compress a 24bit color image
+ (each pixel is represented by one byte for red, one for blue, and
+ one for green) before sending it over the web. By using a smaller
+ 8bit encoding, we can reduce the data to send by two
+ thirds. Ideally, the colors for each of the 256 possible 8bit
encoding values should be chosen to minimize distortion of the
 color. By running kmeans with k=256, we generate a code book of
 256 codes, one for every 8bit sequence. Instead of sending a
 3byte value for each pixel, the centroid index (or code word) of
 the centroid closest to it is is transmitted. The code book is
 also sent over the wire so each received pixel value, represented
 as a centroid index, can be translated back into its 24bit
 representation.
+ color. Running kmeans with k=256 generates a code book of 256
+ codes, which fills up all possible 8bit sequences. Instead of
+ sending a 3byte value for each pixel, the 8bit centroid index
+ (or code word) of the dominating centroid is transmitted. The code
+ book is also sent over the wire so each 8bit code can be
+ translated back to a 24bit pixel value representation. If the
+ image of interest was of an ocean, we would expect many 24bit
+ blues to be represented by 8bit codes. If it was an image of a
+ human face, more flesh tone colors would be represented in the
+ code book.
 This module provides routines for kmeans clustering, generating
 code books from kmeans, and quantizing vectors by comparing
 them to centroids in a code book.
+ All routines expect the observation vectors to be stored as rows
+ in the obs matrix. Similarly the centroids corresponding to the
+ codes are stored as rows of the code_book matrix. The i'th index
+ is the code corresponding to the code_book[i] centroid.
 All routines expect an "observation vector" to be stored in each
 row of the obs matrix. Similarly the centroids corresponding to
 the codes are stored as rows of the code_book matrix. The i'th
 index is the code corresponding to the code_book[i] centroid.

whiten(obs) 
Normalize a group of observations so each feature has unit variance.
vq(obs,code_book) 
More information about the Scipysvn
mailing list