[Scipy-svn] r4198 - trunk/scipy/cluster
Tue Apr 29 18:14:01 CDT 2008
Date: 2008-04-29 18:13:58 -0500 (Tue, 29 Apr 2008)
New Revision: 4198
More grammar and usage edits to vq.py documentation. Thanks to Karen Glocer for her help doing a pass.
--- trunk/scipy/cluster/vq.py 2008-04-29 18:27:02 UTC (rev 4197)
+++ trunk/scipy/cluster/vq.py 2008-04-29 23:13:58 UTC (rev 4198)
@@ -5,20 +5,21 @@
centroids in a code book.
The k-means algorithm takes as input the number of clusters to
- generate k and a set of observation vectors to cluster. It
- returns as its model a set of centroids, one for each of the k
- clusters. An observation vector is classified with the cluster
- number or centroid index of the centroid closest to it.
+ generate, k, and a set of observation vectors to cluster. It
+ returns a set of centroids, one for each of the k clusters. An
+ observation vector is classified with the cluster number or
+ centroid index of the centroid closest to it.
A vector v belongs to cluster i if it is closer to centroid i than
- the other centroids. If v belongs to i, we say centroid i is the
+ any other centroids. If v belongs to i, we say centroid i is the
dominating centroid of v. Common variants of k-means try to
minimize distortion, which is defined as the sum of the distances
between each observation vector and its dominating centroid. Each
step of the k-means algorithm refines the choices of centroids to
reduce distortion. The change in distortion is often used as a
stopping criterion: when the change is lower than a threshold, the
- k-means algorithm is not making sufficient progress and terminates.
+ k-means algorithm is not making sufficient progress and
Since vector quantization is a natural application for k-means,
information theory terminology is often used. The centroid index
@@ -31,7 +32,7 @@
For example, suppose we wish to compress a 24-bit color image
(each pixel is represented by one byte for red, one for blue, and
one for green) before sending it over the web. By using a smaller
- 8-bit encoding, we can reduce the data to send by two
+ 8-bit encoding, we can reduce the amount of data by two
thirds. Ideally, the colors for each of the 256 possible 8-bit
encoding values should be chosen to minimize distortion of the
color. Running k-means with k=256 generates a code book of 256
@@ -46,9 +47,9 @@
All routines expect obs to be a M by N array where the rows are
- the observation vectors. The codebook is a k by N array where
- the i'th row is the centroid of code word i. The observation
- vectors and centroids have the same feature dimension.
+ the observation vectors. The codebook is a k by N array where the
+ i'th row is the centroid of code word i. The observation vectors
+ and centroids have the same feature dimension.
Normalize a group of observations so each feature has unit
@@ -135,7 +136,7 @@
""" Vector Quantization: assign codes from a code book to observations.
Assigns a code from a code book to each observation. Each
- observation vector in the MxN obs array is compared with the
+ observation vector in the M by N obs array is compared with the
centroids in the code book and assigned the code of the closest
@@ -303,9 +304,10 @@
features (eg columns) than obs.
- This could be faster when number of codebooks is small, but it becomes
- a real memory hog when codebook is large. It requires NxMxO storage
- where N=number of obs, M = number of features, and O = number of codes.
+ This could be faster when number of codebooks is small, but it
+ becomes a real memory hog when codebook is large. It requires
+ N by M by O storage where N=number of obs, M = number of
+ features, and O = number of codes.
code : ndarray
@@ -394,8 +396,8 @@
"""Performs k-means on a set of observation vectors forming k
clusters. This yields a code book mapping centroids to codes
and vice versa. The k-means algorithm adjusts the centroids
- until the sufficient progress cannot be made, i.e. the change
- in distortion since the last iteration is less than some
+ until sufficient progress cannot be made, i.e. the change in
+ distortion since the last iteration is less than some
@@ -406,14 +408,13 @@
k_or_guess : int or ndarray
- The number of centroids to generate. One code will be
- assigned to each centroid, and it will be the row index in
- the code_book matrix generated.
+ The number of centroids to generate. A code is assigned to
+ each centroid, which is also the row index of the centroid
+ in the code_book matrix generated.
The initial k centroids are chosen by randomly selecting
observations from the observation matrix. Alternatively,
- passing a k by N array specifies the initial values of the
- k centroids.
+ passing a k by N array specifies the initial k centroids.
iter : int
The number of times to run k-means, returning the codebook
@@ -432,7 +433,7 @@
A k by N array of k centroids. The i'th centroid
codebook[i] is represented with the code i. The centroids
and codes generated represent the lowest distortion seen,
- not necessarily the global minimum distortion.
+ not necessarily the globally minimal distortion.
distortion : float
The distortion between the observations passed and the
@@ -441,7 +442,7 @@
- kmeans2: a different implementation of k-means clustering
with more methods for generating initial centroids but without
- using the distortion change threshold as a stopping criterion.
+ using a distortion change threshold as a stopping criterion.
- whiten: must be called prior to passing an observation matrix
More information about the Scipy-svn