Thu Jul 22 09:48:04 CDT 2010
On Wed, Jul 21, 2010 at 3:25 PM, alex <email@example.com> wrote:
> I want to nitpick about the scipy kmeans clustering implementation.
> Throughout the documentation
> http://docs.scipy.org/doc/scipy/reference/cluster.vq.html and code, the
> 'distortion' of a clustering is defined as "the sum of the distances between
> each observation vector and its dominating centroid." I think that the sum
> of squares of distances should be used instead of the sum of distances, and
> all of the miscellaneous kmeans descriptions I found with google would seem
> to support this.
> For example if one cluster contains the 1D points (1, 2, 3, 4, 10) and the
> old center is 3, then the centroid updating step will move the centroid to
> 4. This step reduces the sum of squares of distances from 55 to 50, but it
> increases the distortion from 11 to 12.
Every implementation of kmeans (except for SciPy's) that I have seen allowed
for the user to specify which distance measure they want to use. There is
no right answer for a distance measure except for "it depends". Maybe
SciPy's implementation should be updated to allow for user-specified
distance measures (e.g. - absolute, euclidian, city-block, etc.)?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the SciPy-User