# [Scipy-svn] r4198 - trunk/scipy/cluster

scipy-svn@scip... scipy-svn@scip...
Tue Apr 29 18:14:01 CDT 2008

```Author: damian.eads
Date: 2008-04-29 18:13:58 -0500 (Tue, 29 Apr 2008)
New Revision: 4198

Modified:
trunk/scipy/cluster/vq.py
Log:
More grammar and usage edits to vq.py documentation. Thanks to Karen Glocer for her help doing a pass.

Modified: trunk/scipy/cluster/vq.py
===================================================================
--- trunk/scipy/cluster/vq.py	2008-04-29 18:27:02 UTC (rev 4197)
+++ trunk/scipy/cluster/vq.py	2008-04-29 23:13:58 UTC (rev 4198)
@@ -5,20 +5,21 @@
centroids in a code book.

The k-means algorithm takes as input the number of clusters to
-    generate k and a set of observation vectors to cluster.  It
-    returns as its model a set of centroids, one for each of the k
-    clusters.  An observation vector is classified with the cluster
-    number or centroid index of the centroid closest to it.
+    generate, k, and a set of observation vectors to cluster.  It
+    returns a set of centroids, one for each of the k clusters.  An
+    observation vector is classified with the cluster number or
+    centroid index of the centroid closest to it.

A vector v belongs to cluster i if it is closer to centroid i than
-    the other centroids. If v belongs to i, we say centroid i is the
+    any other centroids. If v belongs to i, we say centroid i is the
dominating centroid of v. Common variants of k-means try to
minimize distortion, which is defined as the sum of the distances
between each observation vector and its dominating centroid.  Each
step of the k-means algorithm refines the choices of centroids to
reduce distortion. The change in distortion is often used as a
stopping criterion: when the change is lower than a threshold, the
-    k-means algorithm is not making sufficient progress and terminates.
+    k-means algorithm is not making sufficient progress and
+    terminates.

Since vector quantization is a natural application for k-means,
information theory terminology is often used.  The centroid index
@@ -31,7 +32,7 @@
For example, suppose we wish to compress a 24-bit color image
(each pixel is represented by one byte for red, one for blue, and
one for green) before sending it over the web.  By using a smaller
-    8-bit encoding, we can reduce the data to send by two
+    8-bit encoding, we can reduce the amount of data by two
thirds. Ideally, the colors for each of the 256 possible 8-bit
encoding values should be chosen to minimize distortion of the
color. Running k-means with k=256 generates a code book of 256
@@ -46,9 +47,9 @@
code book.

All routines expect obs to be a M by N array where the rows are
-    the observation vectors. The codebook is a k by N array where
-    the i'th row is the centroid of code word i. The observation
-    vectors and centroids have the same feature dimension.
+    the observation vectors. The codebook is a k by N array where the
+    i'th row is the centroid of code word i. The observation vectors
+    and centroids have the same feature dimension.

whiten(obs) --
Normalize a group of observations so each feature has unit
@@ -135,7 +136,7 @@
""" Vector Quantization: assign codes from a code book to observations.

Assigns a code from a code book to each observation. Each
-    observation vector in the MxN obs array is compared with the
+    observation vector in the M by N obs array is compared with the
centroids in the code book and assigned the code of the closest
centroid.

@@ -303,9 +304,10 @@
features (eg columns) than obs.

:Note:
-        This could be faster when number of codebooks is small, but it becomes
-        a real memory hog when codebook is large.  It requires NxMxO storage
-        where N=number of obs, M = number of features, and O = number of codes.
+        This could be faster when number of codebooks is small, but it
+        becomes a real memory hog when codebook is large. It requires
+        N by M by O storage where N=number of obs, M = number of
+        features, and O = number of codes.

:Returns:
code : ndarray
@@ -394,8 +396,8 @@
"""Performs k-means on a set of observation vectors forming k
clusters. This yields a code book mapping centroids to codes
and vice versa. The k-means algorithm adjusts the centroids
-       until the sufficient progress cannot be made, i.e. the change
-       in distortion since the last iteration is less than some
+       until sufficient progress cannot be made, i.e. the change in
+       distortion since the last iteration is less than some
threshold.

:Parameters:
@@ -406,14 +408,13 @@
function.

k_or_guess : int or ndarray
-            The number of centroids to generate. One code will be
-            assigned to each centroid, and it will be the row index in
-            the code_book matrix generated.
+            The number of centroids to generate. A code is assigned to
+            each centroid, which is also the row index of the centroid
+            in the code_book matrix generated.

The initial k centroids are chosen by randomly selecting
observations from the observation matrix. Alternatively,
-            passing a k by N array specifies the initial values of the
-            k centroids.
+            passing a k by N array specifies the initial k centroids.

iter : int
The number of times to run k-means, returning the codebook
@@ -432,7 +433,7 @@
A k by N array of k centroids. The i'th centroid
codebook[i] is represented with the code i. The centroids
and codes generated represent the lowest distortion seen,
-            not necessarily the global minimum distortion.
+            not necessarily the globally minimal distortion.

distortion : float
The distortion between the observations passed and the
@@ -441,7 +442,7 @@
:SeeAlso:
- kmeans2: a different implementation of k-means clustering
with more methods for generating initial centroids but without
-          using the distortion change threshold as a stopping criterion.
+          using a distortion change threshold as a stopping criterion.
- whiten: must be called prior to passing an observation matrix
to kmeans.

```

More information about the Scipy-svn mailing list