[SciPy-User] kmeans

Keith Goodman kwgoodman@gmail....
Sun Jul 25 16:53:53 CDT 2010


On Sun, Jul 25, 2010 at 12:41 PM, David Cournapeau <cournape@gmail.com> wrote:
> On Sun, Jul 25, 2010 at 2:36 AM, Keith Goodman <kwgoodman@gmail.com> wrote:
>> _kmeans chokes on large thresholds:
>>
>>>> from scipy import cluster
>>>> v = np.array([1,2,3,4,10], dtype=float)
>>>> cluster.vq.kmeans(v, 1, thresh=1e15)
>>   (array([ 4.]), 2.3999999999999999)
>>>> cluster.vq.kmeans(v, 1, thresh=1e16)
>> <snip>
>> IndexError: list index out of range
>>
>> The problem is in these lines:
>>
>>    diff = thresh+1.
>>    while diff > thresh:
>>        <snip>
>>        if(diff > thresh):
>>
>> If thresh is large then (thresh + 1) > thresh is False:
>>
>>>> thresh = 1e16
>>>> diff = thresh + 1.0
>>>> diff > thresh
>>   False
>>
>> What's a use case for a large threshold? You might want to study the
>> algorithm by seeing the result after one iteration (not to be confused
>> with the iter input which is something else).
>>
>> One fix is to use 2*thresh instead for thresh + 1. But that just
>> pushes the problem out to higher thresholds
>
> Or just use the spacing function, which by definition returns the
> smallest number M such as thresh + M > thresh (except for nan/inf)

Neat, I've never heard of np.spacing. But it suffers the same fate:

Works:

>> thresh = 1e16
>> diff = thresh + np.spacing(thresh)
>> diff > thresh
   True

Doesn't work:

>> thresh = 1e400
>> diff = thresh + np.spacing(thresh)
>> diff > thresh
   False

len(avg_dist) == 0 could be used to mark the first time through the loop.

Another minor issue:

The kmeans docstring says iteration stops when the change in
distortion is less than threshold. But as coded (if diff > thresh)
iteration also stops when the change is equal to the threshold.

Could either fix the code or the docstring. Fixing the code (if diff
>= thresh) means that thresh=0 could enter an infinite loop (negative
thresh already enters an infinite loop). So fixing the docstring seems
better.

To avoid infinite loops, I think iteration should terminite when there
is no change in distortion. But then, since there would be two
termination reasons, you'd probably want to output the reason
iteration terminiated.


More information about the SciPy-User mailing list