[SciPy-User] kmeans
Benjamin Root
ben.root@ou....
Fri Jul 23 12:19:39 CDT 2010
On Thu, Jul 22, 2010 at 10:49 PM, Benjamin Root <ben.root@ou.edu> wrote:
> On Thu, Jul 22, 2010 at 2:23 PM, Benjamin Root <ben.root@ou.edu> wrote:
>
>> On Thu, Jul 22, 2010 at 12:12 PM, alex <argriffi@ncsu.edu> wrote:
>> >>
>> >>> For example:
>> >>>
>> >>> >>> import numpy as np
>> >>> >>> from scipy import cluster
>> >>> >>> v = np.array([1,2,3,4,10])
>> >>> >>> cluster.vq.kmeans(v, 1)
>> >>> (array([4]), 2.3999999999999999)
>> >>> >>> np.mean([abs(x-4) for x in v])
>> >>> 2.3999999999999999
>> >>> >>> np.mean([abs(x-3) for x in v])
>> >>> 2.2000000000000002
>> >>>
>> >>> The result of this kmeans call suggests that the center 4 is best with
>> >>> distortion 2.4. In fact this is not the case because a center of 3
>> would
>> >>> have distortion 2.2.
>> >>>
>> >>
>> >> I wonder if this is really a bug in the minimization code rather than
>> an
>> >> issue with the distortion measure itself.
>> >>
>> >> Ben Root
>> >
>> > The bug is in the _kmeans function in vq.py where it uses avg_dist[-2] -
>> > avg_dist[-1] <= thresh as a stopping condition. This condition
>> mistakenly
>> > assumes that the distortion monotonically decreases. One consequence is
>> > that when the distortion increases, avg_dist[-2] - avg_dist[-1] will be
>> > negative, and the codebook and distortion associated with avg_dist[-1]
>> are
>> > returned. This is where the 2.4 vs 2.2 error comes from.
>> >
>> > I guess there could be a few ways to resolve the bug. One way could be
>> to
>> > use the sum of squares of distances instead of the distortion; this
>> would
>> > guarantee that the error sequence monotonically decreases, and I suspect
>> > that this is what the author had originally intended.
>> >
>> > Another way to deal with the bug could be to report the second to last
>> > codebook and distortion instead of the last codebook and distortion when
>> the
>> > stopping condition is met. This would probably fix the bug in the 2.2
>> vs.
>> > 2.4 example, but it is kind of a kludge; if the sequence does not
>> > monotonically decrease, then does it really make sense to use a small
>> change
>> > as a stopping condition?
>> >
>> > Alex
>> >
>>
>> I have to search for some old code of mine, but if I remember correctly,
>> it would seem that the latter solution is actually a better fix because it
>> addresses the heart of the matter. A minimization algorithm that doesn't
>> return the most minimum value that it knows is buggy. The stopping
>> condition is another issue altogether.
>>
>> There was a nice C-implemented, MIT-licensed kmeans algorithm that I used
>> several years ago that had a much smarter stopping condition and several
>> other useful features. Let me see if I can find it and we can compare.
>>
>> Ben Root
>>
>>
>>
> Ok, I think I found it. If this is the same library that I remember, it
> isn't MIT licensed, but rather it seems to be Python licensed and already in
> use for Pycluster.
>
> http://bonsai.hgc.jp/~mdehoon/software/cluster/index.html<http://bonsai.hgc.jp/%7Emdehoon/software/cluster/index.html>
>
> I'll look through the code in the morning to see what the stopping
> condition is, but I believe that it keeps iterating until it can no longer
> make any reassignments.
>
> Ben Root
>
Examining further, I see that SciPy's implementation is fairly simplistic
and has some issues. In the given example, the reason why 3 is never
returned is not because of the use of the distortion metric, but rather
because the kmeans function never sees the distance for using 3. As a
matter of fact, the actual code that does the convergence is in vq and py_vq
(vector quantization) and it tries to minimize the sum of squared errors.
kmeans just keeps on retrying the convergence with random guesses to see if
different convergences occur.
Is Pycluster even maintained anymore? Maybe we should look into integrating
it into SciPy if it isn't being maintained.
Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20100723/2c9bfef9/attachment.html
More information about the SciPy-User
mailing list