Thu Jul 22 22:49:37 CDT 2010
On Thu, Jul 22, 2010 at 2:23 PM, Benjamin Root <firstname.lastname@example.org> wrote:
> On Thu, Jul 22, 2010 at 12:12 PM, alex <email@example.com> wrote:
> >>> For example:
> >>> >>> import numpy as np
> >>> >>> from scipy import cluster
> >>> >>> v = np.array([1,2,3,4,10])
> >>> >>> cluster.vq.kmeans(v, 1)
> >>> (array(), 2.3999999999999999)
> >>> >>> np.mean([abs(x-4) for x in v])
> >>> 2.3999999999999999
> >>> >>> np.mean([abs(x-3) for x in v])
> >>> 2.2000000000000002
> >>> The result of this kmeans call suggests that the center 4 is best with
> >>> distortion 2.4. In fact this is not the case because a center of 3
> >>> have distortion 2.2.
> >> I wonder if this is really a bug in the minimization code rather than an
> >> issue with the distortion measure itself.
> >> Ben Root
> > The bug is in the _kmeans function in vq.py where it uses avg_dist[-2] -
> > avg_dist[-1] <= thresh as a stopping condition. This condition
> > assumes that the distortion monotonically decreases. One consequence is
> > that when the distortion increases, avg_dist[-2] - avg_dist[-1] will be
> > negative, and the codebook and distortion associated with avg_dist[-1]
> > returned. This is where the 2.4 vs 2.2 error comes from.
> > I guess there could be a few ways to resolve the bug. One way could be
> > use the sum of squares of distances instead of the distortion; this would
> > guarantee that the error sequence monotonically decreases, and I suspect
> > that this is what the author had originally intended.
> > Another way to deal with the bug could be to report the second to last
> > codebook and distortion instead of the last codebook and distortion when
> > stopping condition is met. This would probably fix the bug in the 2.2
> > 2.4 example, but it is kind of a kludge; if the sequence does not
> > monotonically decrease, then does it really make sense to use a small
> > as a stopping condition?
> > Alex
> I have to search for some old code of mine, but if I remember correctly, it
> would seem that the latter solution is actually a better fix because it
> addresses the heart of the matter. A minimization algorithm that doesn't
> return the most minimum value that it knows is buggy. The stopping
> condition is another issue altogether.
> There was a nice C-implemented, MIT-licensed kmeans algorithm that I used
> several years ago that had a much smarter stopping condition and several
> other useful features. Let me see if I can find it and we can compare.
> Ben Root
Ok, I think I found it. If this is the same library that I remember, it
isn't MIT licensed, but rather it seems to be Python licensed and already in
use for Pycluster.
I'll look through the code in the morning to see what the stopping condition
is, but I believe that it keeps iterating until it can no longer make any
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the SciPy-User