Wed Jul 21 15:25:13 CDT 2010
I want to nitpick about the scipy kmeans clustering implementation.
Throughout the documentation
http://docs.scipy.org/doc/scipy/reference/cluster.vq.html and code, the
'distortion' of a clustering is defined as "the sum of the distances between
each observation vector and its dominating centroid." I think that the sum
of squares of distances should be used instead of the sum of distances, and
all of the miscellaneous kmeans descriptions I found with google would seem
to support this.
For example if one cluster contains the 1D points (1, 2, 3, 4, 10) and the
old center is 3, then the centroid updating step will move the centroid to
4. This step reduces the sum of squares of distances from 55 to 50, but it
increases the distortion from 11 to 12.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the SciPy-User