[SciPy-User] kmeans

alex argriffi@ncsu....
Thu Jul 22 12:12:01 CDT 2010

> For example:
>> >>> import numpy as np
>> >>> from scipy import cluster
>> >>> v = np.array([1,2,3,4,10])
>> >>> cluster.vq.kmeans(v, 1)
>> (array([4]), 2.3999999999999999)
>> >>> np.mean([abs(x-4) for x in v])
>> 2.3999999999999999
>> >>> np.mean([abs(x-3) for x in v])
>> 2.2000000000000002
>> The result of this kmeans call suggests that the center 4 is best with
>> distortion 2.4.  In fact this is not the case because a center of 3 would
>> have distortion 2.2.
> I wonder if this is really a bug in the minimization code rather than an
> issue with the distortion measure itself.
> Ben Root

The bug is in the _kmeans function in vq.py where it uses avg_dist[-2] -
avg_dist[-1] <= thresh as a stopping condition.  This condition mistakenly
assumes that the distortion monotonically decreases.  One consequence is
that when the distortion increases, avg_dist[-2] - avg_dist[-1] will be
negative, and the codebook and distortion associated with avg_dist[-1] are
returned.  This is where the 2.4 vs 2.2 error comes from.

I guess there could be a few ways to resolve the bug.  One way could be to
use the sum of squares of distances instead of the distortion; this would
guarantee that the error sequence monotonically decreases, and I suspect
that this is what the author had originally intended.

Another way to deal with the bug could be to report the second to last
codebook and distortion instead of the last codebook and distortion when the
stopping condition is met.  This would probably fix the bug in the 2.2 vs.
2.4 example, but it is kind of a kludge; if the sequence does not
monotonically decrease, then does it really make sense to use a small change
as a stopping condition?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20100722/bd7fff91/attachment.html 

More information about the SciPy-User mailing list