[SciPy-User] kmeans
Benjamin Root
ben.root@ou....
Thu Jul 22 22:49:37 CDT 2010
On Thu, Jul 22, 2010 at 2:23 PM, Benjamin Root <ben.root@ou.edu> wrote:
> On Thu, Jul 22, 2010 at 12:12 PM, alex <argriffi@ncsu.edu> wrote:
> >>
> >>> For example:
> >>>
> >>> >>> import numpy as np
> >>> >>> from scipy import cluster
> >>> >>> v = np.array([1,2,3,4,10])
> >>> >>> cluster.vq.kmeans(v, 1)
> >>> (array([4]), 2.3999999999999999)
> >>> >>> np.mean([abs(x-4) for x in v])
> >>> 2.3999999999999999
> >>> >>> np.mean([abs(x-3) for x in v])
> >>> 2.2000000000000002
> >>>
> >>> The result of this kmeans call suggests that the center 4 is best with
> >>> distortion 2.4. In fact this is not the case because a center of 3
> would
> >>> have distortion 2.2.
> >>>
> >>
> >> I wonder if this is really a bug in the minimization code rather than an
> >> issue with the distortion measure itself.
> >>
> >> Ben Root
> >
> > The bug is in the _kmeans function in vq.py where it uses avg_dist[-2] -
> > avg_dist[-1] <= thresh as a stopping condition. This condition
> mistakenly
> > assumes that the distortion monotonically decreases. One consequence is
> > that when the distortion increases, avg_dist[-2] - avg_dist[-1] will be
> > negative, and the codebook and distortion associated with avg_dist[-1]
> are
> > returned. This is where the 2.4 vs 2.2 error comes from.
> >
> > I guess there could be a few ways to resolve the bug. One way could be
> to
> > use the sum of squares of distances instead of the distortion; this would
> > guarantee that the error sequence monotonically decreases, and I suspect
> > that this is what the author had originally intended.
> >
> > Another way to deal with the bug could be to report the second to last
> > codebook and distortion instead of the last codebook and distortion when
> the
> > stopping condition is met. This would probably fix the bug in the 2.2
> vs.
> > 2.4 example, but it is kind of a kludge; if the sequence does not
> > monotonically decrease, then does it really make sense to use a small
> change
> > as a stopping condition?
> >
> > Alex
> >
>
> I have to search for some old code of mine, but if I remember correctly, it
> would seem that the latter solution is actually a better fix because it
> addresses the heart of the matter. A minimization algorithm that doesn't
> return the most minimum value that it knows is buggy. The stopping
> condition is another issue altogether.
>
> There was a nice C-implemented, MIT-licensed kmeans algorithm that I used
> several years ago that had a much smarter stopping condition and several
> other useful features. Let me see if I can find it and we can compare.
>
> Ben Root
>
>
>
Ok, I think I found it. If this is the same library that I remember, it
isn't MIT licensed, but rather it seems to be Python licensed and already in
use for Pycluster.
http://bonsai.hgc.jp/~mdehoon/software/cluster/index.html
I'll look through the code in the morning to see what the stopping condition
is, but I believe that it keeps iterating until it can no longer make any
reassignments.
Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20100722/d2ce56ae/attachment.html
More information about the SciPy-User
mailing list