[SciPy-User] kmeans

Keith Goodman kwgoodman@gmail....
Thu Jul 22 14:15:54 CDT 2010

On Thu, Jul 22, 2010 at 10:12 AM, alex <argriffi@ncsu.edu> wrote:
>>> For example:
>>> >>> import numpy as np
>>> >>> from scipy import cluster
>>> >>> v = np.array([1,2,3,4,10])
>>> >>> cluster.vq.kmeans(v, 1)
>>> (array([4]), 2.3999999999999999)
>>> >>> np.mean([abs(x-4) for x in v])
>>> 2.3999999999999999
>>> >>> np.mean([abs(x-3) for x in v])
>>> 2.2000000000000002
>>> The result of this kmeans call suggests that the center 4 is best with
>>> distortion 2.4.  In fact this is not the case because a center of 3 would
>>> have distortion 2.2.
>> I wonder if this is really a bug in the minimization code rather than an
>> issue with the distortion measure itself.
>> Ben Root
> The bug is in the _kmeans function in vq.py where it uses avg_dist[-2] -
> avg_dist[-1] <= thresh as a stopping condition.  This condition mistakenly
> assumes that the distortion monotonically decreases.  One consequence is
> that when the distortion increases, avg_dist[-2] - avg_dist[-1] will be
> negative, and the codebook and distortion associated with avg_dist[-1] are
> returned.  This is where the 2.4 vs 2.2 error comes from.
> I guess there could be a few ways to resolve the bug.  One way could be to
> use the sum of squares of distances instead of the distortion; this would
> guarantee that the error sequence monotonically decreases, and I suspect
> that this is what the author had originally intended.

You'd like to minimize the squared error (I don't know much about it
but that makes sense to me). But in the example you chose, the squared
error is minimized since the mean is 4. Was that just a coincidence? I
guess in the end the code is protected against any claims of bugs
since it doesn't guarantee to find the global minimum :)

> Another way to deal with the bug could be to report the second to last
> codebook and distortion instead of the last codebook and distortion when the
> stopping condition is met.  This would probably fix the bug in the 2.2 vs.
> 2.4 example, but it is kind of a kludge; if the sequence does not
> monotonically decrease, then does it really make sense to use a small change
> as a stopping condition?
> Alex
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user

More information about the SciPy-User mailing list