Thu Jul 22 14:23:47 CDT 2010
On Thu, Jul 22, 2010 at 12:12 PM, alex <email@example.com> wrote:
>>> For example:
>>> >>> import numpy as np
>>> >>> from scipy import cluster
>>> >>> v = np.array([1,2,3,4,10])
>>> >>> cluster.vq.kmeans(v, 1)
>>> (array(), 2.3999999999999999)
>>> >>> np.mean([abs(x-4) for x in v])
>>> >>> np.mean([abs(x-3) for x in v])
>>> The result of this kmeans call suggests that the center 4 is best with
>>> distortion 2.4. In fact this is not the case because a center of 3
>>> have distortion 2.2.
>> I wonder if this is really a bug in the minimization code rather than an
>> issue with the distortion measure itself.
>> Ben Root
> The bug is in the _kmeans function in vq.py where it uses avg_dist[-2] -
> avg_dist[-1] <= thresh as a stopping condition. This condition mistakenly
> assumes that the distortion monotonically decreases. One consequence is
> that when the distortion increases, avg_dist[-2] - avg_dist[-1] will be
> negative, and the codebook and distortion associated with avg_dist[-1] are
> returned. This is where the 2.4 vs 2.2 error comes from.
> I guess there could be a few ways to resolve the bug. One way could be to
> use the sum of squares of distances instead of the distortion; this would
> guarantee that the error sequence monotonically decreases, and I suspect
> that this is what the author had originally intended.
> Another way to deal with the bug could be to report the second to last
> codebook and distortion instead of the last codebook and distortion when
> stopping condition is met. This would probably fix the bug in the 2.2 vs.
> 2.4 example, but it is kind of a kludge; if the sequence does not
> monotonically decrease, then does it really make sense to use a small
> as a stopping condition?
I have to search for some old code of mine, but if I remember correctly, it
would seem that the latter solution is actually a better fix because it
addresses the heart of the matter. A minimization algorithm that doesn't
return the most minimum value that it knows is buggy. The stopping
condition is another issue altogether.
There was a nice C-implemented, MIT-licensed kmeans algorithm that I used
several years ago that had a much smarter stopping condition and several
other useful features. Let me see if I can find it and we can compare.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the SciPy-User