# [Scipy-svn] r4670 - trunk/scipy/cluster

scipy-svn@scip... scipy-svn@scip...
Sat Aug 23 14:04:55 CDT 2008

Author: damian.eads
Date: 2008-08-23 14:04:52 -0500 (Sat, 23 Aug 2008)
New Revision: 4670

Modified:
trunk/scipy/cluster/hierarchy.py
Log:
Finished RSTifying scipy.cluster.hierarchy.linkage function.

Modified: trunk/scipy/cluster/hierarchy.py
===================================================================
--- trunk/scipy/cluster/hierarchy.py	2008-08-23 18:44:21 UTC (rev 4669)
+++ trunk/scipy/cluster/hierarchy.py	2008-08-23 19:04:52 UTC (rev 4670)
@@ -427,114 +427,153 @@

def linkage(y, method='single', metric='euclidean'):
-    """ Z = linkage(y, method)
+    """
+    Performs hierarchical/agglomerative clustering on the
+    condensed distance matrix y. y must be a {n \choose 2} sized
+    vector where n is the number of original observations paired
+    in the distance matrix. The behavior of this function is very
+    similar to the MATLAB(TM) linkage function.

-          Performs hierarchical/agglomerative clustering on the
-          condensed distance matrix y. y must be a {n \choose 2} sized
-          vector where n is the number of original observations paired
-          in the distance matrix. The behavior of this function is very
-          similar to the MATLAB(TM) linkage function.
+    A 4 by :math:$(n-1)$ matrix Z is returned. At the
+    :math:$i$th iteration, clusters with indices Z[i, 0] and
+    Z[i, 1] are combined to form cluster :math:$n + i$. A
+    cluster with an index less than :math:$n$ corresponds to one of
+    the :math:$n$ original observations. The distance between
+    clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The
+    fourth value Z[i, 3] represents the number of original
+    observations in the newly formed cluster.

-          A (n - 1) * 4 matrix Z is returned. At the i'th iteration,
-          clusters with indices Z[i, 0] and Z[i, 1] are combined to
-          form cluster n + i. A cluster with an index less than n
-          corresponds to one of the n original observations. The
-          distance between clusters Z[i, 0] and Z[i, 1] is given by
-          Z[i, 2]. The fourth value Z[i, 3] represents the number of
-          original observations in the newly formed cluster.
+    The following linkage methods are used to compute the distance
+    :math:$d(s, t)$ between two clusters :math:$s$ and
+    :math:$t$. The algorithm begins with a forest of clusters that
+    have yet to be used in the hierarchy being formed. When two
+    clusters :math:$s$ and :math:$t$ from this forest are combined
+    into a single cluster :math:$u$, :math:$s$ and :math:$t$ are
+    removed from the forest, and :math:$u$ is added to the
+    forest. When only one cluster remains in the forest, the algorithm
+    stops, and this cluster becomes the root.

-          The following linkage methods are used to compute the
-          distance dist(s, t) between two clusters s and t. The
-          algorithm begins with a forest of clusters that have yet
-          to be used in the hierarchy being formed. When two clusters
-          s and t from this forest are combined into a single cluster u,
-          s and t are removed from the forest, and u is added to
-          the forest. When only one cluster remains in the forest,
-          the algorithm stops, and this cluster becomes the root.
+    A distance matrix is maintained at each iteration. The d[i,j]
+    entry corresponds to the distance between cluster :math:$i$ and
+    :math:$j$ in the original forest.

-          A distance matrix is maintained at each iteration. The
-          d[i,j] entry corresponds to the distance between cluster
-          i and j in the original forest.
+    At each iteration, the algorithm must update the distance matrix
+    to reflect the distance of the newly formed cluster u with the
+    remaining clusters in the forest.

-          At each iteration, the algorithm must update the distance
-          matrix to reflect the distance of the newly formed cluster
-          u with the remaining clusters in the forest.
+    Suppose there are :math:$|u|$ original observations
+    :math:$u[0], \ldots, u[|u|-1]$ in cluster :math:$u$ and
+    :math:$|v|$ original objects :math:$v[0], \ldots, v[|v|-1]$ in
+    cluster :math:$v$. Recall :math:$s$ and :math:$t$ are
+    combined to form cluster :math:$u$. Let :math:$v$ be any
+    remaining cluster in the forest that is not :math:$u$.

-          Suppose there are |u| original observations u[0], ..., u[|u|-1]
-          in cluster u and |v| original objects v[0], ..., v[|v|-1]
-          in cluster v. Recall s and t are combined to form cluster
-          u. Let v be any remaining cluster in the forest that is not
-          u.
+    :Parameters:
+       Q : ndarray
+           A condensed or redundant distance matrix. A condensed
+           distance matrix is a flat array containing the upper
+           triangular of the distance matrix. This is the form that
+           pdist returns. Alternatively, a collection of
+           :math:$m$ observation vectors in n dimensions may be passed as
+           a :math:$m$ by :math:$n$ array.
+       method : string
+           The linkage algorithm to use. See the Linkage Methods
+           section below for full descriptions.
+       metric : string
+           The distance metric to use. See the Distance Metric
+           section below for full descriptions.

-          The following are methods for calculating the distance between
-          the newly formed cluster u and each v.
+    ---------------

-            * method='single' assigns dist(u,v) = MIN(dist(u[i],v[j])
-              for all points i in cluster u and j in cluster v.
+    The following are methods for calculating the distance between the
+    newly formed cluster :math:$u$ and each :math:$v$.

-                (also called Nearest Point Algorithm)
+    * method=single assigns

-            * method='complete' assigns dist(u,v) = MAX(dist(u[i],v[j])
-              for all points i in cluster u and j in cluster v.
+      .. math:
+         d(u,v) = \min(dist(u[i],v[j]))

-                (also called Farthest Point Algorithm
-                      or the Voor Hees Algorithm)
+      for all points :math:$i$ in cluster :math:$u$ and
+      :math:$j$ in cluster :math:$v$. This is also known as the
+      Nearest Point Algorithm.

-           * method='average' assigns dist(u,v) =
-                \sum_{ij} { dist(u[i], v[j]) } / (|u|*|v|)
-             for all points i and j where |u| and |v| are the
-             cardinalities of clusters u and v, respectively.
+    * method=complete assigns

-                (also called UPGMA)
+      .. math:
+         d(u, v) = \max(dist(u[i],v[j]))

-           * method='weighted' assigns
+      for all points :math:$i$ in cluster u and :math:$j$ in
+      cluster :math:$v$. This is also known by the Farthest Point
+      Algorithm or Voor Hees Algorithm.

-               dist(u,v) = (dist(s,v) + dist(t,v))/2
+    * method=average assigns

-             where cluster u was formed with cluster s and t and v
-             is a remaining cluster in the forest. (also called WPGMA)
+      .. math:
+         d(u,v) = \sum_{ij} \frac{d(u[i], v[j])}
+                                 {(|u|*|v|)

-        Z = linkage(X, method, metric='euclidean')
+      for all points :math:$i$ and :math:$j$ where :math:$|u|$
+      and :math:$|v|$ are the cardinalities of clusters :math:$u$
+      and :math:$v$, respectively. This is also called the UPGMA
+      algorithm. This is called UPGMA.

-         Performs hierarchical clustering on the objects defined by the
-         n by m observation matrix X.
+    * method='weighted' assigns

-         If the metric is 'euclidean' then the following methods may be
-         used:
+      .. math:
+         d(u,v) = (dist(s,v) + dist(t,v))/2

-           * method='centroid' assigns dist(s,t) = euclid(c_s, c_t) where
-             c_s and c_t are the centroids of clusters s and t,
-             respectively. When two clusters s and t are combined into a new
-             cluster u, the new centroid is computed over all the original
-             objects in clusters s and t. The distance then becomes
-             the Euclidean distance between the centroid of u and the
-             centroid of a remaining cluster v in the forest.
-             (also called UPGMC)
+      where cluster u was formed with cluster s and t and v
+      is a remaining cluster in the forest. (also called WPGMA)

-           * method='median' assigns dist(s,t) as above. When two clusters
-             s and t are combined into a new cluster u, the average of
-             centroids s and t give the new centroid u. (also called WPGMC)

-           * method='ward' uses the Ward variance minimization algorithm.
-             The new entry dist(u, v) is computed as follows,
+    * method='centroid' assigns

-                 dist(u,v) =
+      .. math:
+         dist(s,t) = euclid(c_s, c_t)

-                ----------------------------------------------------
-                | |v|+|s|            |v|+|t|            |v|
-                | ------- d(v,s)^2 + ------- d(v,t)^2 - --- d(s,t)^2
-               \|    T                  T                T
+      where :math:$c_s$ and :math:$c_t$ are the centroids of
+      clusters :math:$s$ and :math:$t$, respectively. When two
+      clusters :math:$s$ and :math:$t$ are combined into a new
+      cluster :math:$u$, the new centroid is computed over all the
+      original objects in clusters :math:$s$ and :math:$t$. The
+      distance then becomes the Euclidean distance between the
+      centroid of :math:$u$ and the centroid of a remaining cluster
+      :math:$v$ in the forest. This is also known as the UPGMC
+      algorithm.

-             where u is the newly joined cluster consisting of clusters
-             s and t, v is an unused cluster in the forest, T=|v|+|s|+|t|,
-             and |*| is the cardinality of its argument.
-             (also called incremental)
+    * method='median' assigns math:$d(s,t)$ like the centroid
+      method. When two clusters s and t are combined into a new
+      cluster :math:$u$, the average of centroids s and t give the
+      new centroid :math:$u$. This is also known as the WPGMC
+      algorithm.

-           Warning to MATLAB(TM) users: when the minimum distance pair in
-           the forest is chosen, there may be two or more pairs with the
-           same minimum distance. This implementation may chose a
-           different minimum than the MATLAB(TM) version.
-        """
+    * method='ward' uses the Ward variance minimization algorithm.
+      The new entry :math:$d(u,v)$ is computed as follows,
+
+      .. math:
+
+         d(u,v) = \sqrt{\frac{|v|+|s|}
+                             {T}d(v,s)^2
+                      + \frac{|v|+|t|}
+                             {T}d(v,t)^2
+                      + \frac{|v|}
+                             {T}d(s,t)^2}
+
+      where :math:$u$ is the newly joined cluster consisting of
+      clusters :math:$s$ and :math:$t$, :math:$v$ is an unused
+      cluster in the forest, :math:$T=|v|+|s|+|t|$, and
+      :math:$|*|$ is the cardinality of its argument. This is also
+      known as the incremental algorithm.
+
+   Warning
+   -------
+
+   When the minimum distance pair in the forest is chosen, there may
+   be two or more pairs with the same minimum distance. This
+   implementation may chose a different minimum than the MATLAB(TM)
+   version.
+   """
if not isinstance(method, str):
raise TypeError("Argument 'method' must be a string.")