[SciPy-User] kmeans

Sun Jul 25 16:53:53 CDT 2010

On Sun, Jul 25, 2010 at 12:41 PM, David Cournapeau <cournape@gmail.com> wrote:
> On Sun, Jul 25, 2010 at 2:36 AM, Keith Goodman <kwgoodman@gmail.com> wrote:
>> _kmeans chokes on large thresholds:
>>
>>>> from scipy import cluster
>>>> v = np.array([1,2,3,4,10], dtype=float)
>>>> cluster.vq.kmeans(v, 1, thresh=1e15)
>>   (array([ 4.]), 2.3999999999999999)
>>>> cluster.vq.kmeans(v, 1, thresh=1e16)
>> <snip>
>> IndexError: list index out of range
>>
>> The problem is in these lines:
>>
>>    diff = thresh+1.
>>    while diff > thresh:
>>        <snip>
>>        if(diff > thresh):
>>
>> If thresh is large then (thresh + 1) > thresh is False:
>>
>>>> thresh = 1e16
>>>> diff = thresh + 1.0
>>>> diff > thresh
>>   False
>>
>> What's a use case for a large threshold? You might want to study the
>> algorithm by seeing the result after one iteration (not to be confused
>> with the iter input which is something else).
>>
>> One fix is to use 2*thresh instead for thresh + 1. But that just
>> pushes the problem out to higher thresholds
>
> Or just use the spacing function, which by definition returns the
> smallest number M such as thresh + M > thresh (except for nan/inf)

Neat, I've never heard of np.spacing. But it suffers the same fate:

Works:

>> thresh = 1e16
>> diff = thresh + np.spacing(thresh)
>> diff > thresh
True

Doesn't work:

>> thresh = 1e400
>> diff = thresh + np.spacing(thresh)
>> diff > thresh
False

len(avg_dist) == 0 could be used to mark the first time through the loop.

Another minor issue:

The kmeans docstring says iteration stops when the change in
distortion is less than threshold. But as coded (if diff > thresh)
iteration also stops when the change is equal to the threshold.

Could either fix the code or the docstring. Fixing the code (if diff
>= thresh) means that thresh=0 could enter an infinite loop (negative
thresh already enters an infinite loop). So fixing the docstring seems
better.

To avoid infinite loops, I think iteration should terminite when there
is no change in distortion. But then, since there would be two
termination reasons, you'd probably want to output the reason
iteration terminiated.
