[SciPy-User] kmeans

alex argriffi@ncsu....
Thu Jul 22 10:42:36 CDT 2010

On Thu, Jul 22, 2010 at 10:48 AM, Benjamin Root <ben.root@ou.edu> wrote:

> On Wed, Jul 21, 2010 at 3:25 PM, alex <argriffi@ncsu.edu> wrote:
>> Hi,
>> I want to nitpick about the scipy kmeans clustering implementation.
>> Throughout the documentation
>> http://docs.scipy.org/doc/scipy/reference/cluster.vq.html and code, the
>> 'distortion' of a clustering is defined as "the sum of the distances between
>> each observation vector and its dominating centroid."  I think that the sum
>> of squares of distances should be used instead of the sum of distances, and
>> all of the miscellaneous kmeans descriptions I found with google would seem
>> to support this.
>> For example if one cluster contains the 1D points (1, 2, 3, 4, 10) and the
>> old center is 3, then the centroid updating step will move the centroid to
>> 4.  This step reduces the sum of squares of distances from 55 to 50, but it
>> increases the distortion from 11 to 12.
>> Alex
> Every implementation of kmeans (except for SciPy's) that I have seen
> allowed for the user to specify which distance measure they want to use.
> There is no right answer for a distance measure except for "it depends".
> Maybe SciPy's implementation should be updated to allow for user-specified
> distance measures (e.g. - absolute, euclidian, city-block, etc.)?
> Ben Root

While the best distance might depend on the application, I think that of all
these distance measures only the sum of squares of Euclidean distances is
guaranteed to monotonically decrease at each step of the algorithm.  If the
scipy kmeans implementation depends on this monotonicity, which I think it
currently does, then this assumption could be the source of subtle bugs when
error measures like distortion are used.

Maybe I can nitpick more effectively if I frame it as "scipy is doing
something strange and possibly buggy" instead of as "scipy is not using the
distance function I want".

For example:

>>> import numpy as np
>>> from scipy import cluster
>>> v = np.array([1,2,3,4,10])
>>> cluster.vq.kmeans(v, 1)
(array([4]), 2.3999999999999999)
>>> np.mean([abs(x-4) for x in v])
>>> np.mean([abs(x-3) for x in v])

The result of this kmeans call suggests that the center 4 is best with
distortion 2.4.  In fact this is not the case because a center of 3 would
have distortion 2.2.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20100722/353a3ab6/attachment.html 

More information about the SciPy-User mailing list