[SciPy-User] kmeans

Keith Goodman kwgoodman@gmail....
Sat Jul 24 12:36:17 CDT 2010


_kmeans chokes on large thresholds:

>> from scipy import cluster
>> v = np.array([1,2,3,4,10], dtype=float)
>> cluster.vq.kmeans(v, 1, thresh=1e15)
   (array([ 4.]), 2.3999999999999999)
>> cluster.vq.kmeans(v, 1, thresh=1e16)
<snip>
IndexError: list index out of range

The problem is in these lines:

    diff = thresh+1.
    while diff > thresh:
        <snip>
        if(diff > thresh):

If thresh is large then (thresh + 1) > thresh is False:

>> thresh = 1e16
>> diff = thresh + 1.0
>> diff > thresh
   False

What's a use case for a large threshold? You might want to study the
algorithm by seeing the result after one iteration (not to be confused
with the iter input which is something else).

One fix is to use 2*thresh instead for thresh + 1. But that just
pushes the problem out to higher thresholds:

>> thresh = 1e16
>> diff = 2 * thresh
>> diff > thresh
   True

>> thresh = 1e400
>> diff = 2 * thresh
>> diff > thresh
   False

A better fix is to replace:

if dist > thresh

with

if (dist > thresh) or (count = 0)

or

if (dist > thresh) or firstflag

Ticket: http://projects.scipy.org/scipy/ticket/1247


More information about the SciPy-User mailing list