# [SciPy-User] kmeans

Benjamin Root ben.root@ou....
Sun Jul 25 17:11:12 CDT 2010

```On Sun, Jul 25, 2010 at 4:53 PM, Keith Goodman <kwgoodman@gmail.com> wrote:

> On Sun, Jul 25, 2010 at 12:41 PM, David Cournapeau <cournape@gmail.com>
> wrote:
> > On Sun, Jul 25, 2010 at 2:36 AM, Keith Goodman <kwgoodman@gmail.com>
> wrote:
> >> _kmeans chokes on large thresholds:
> >>
> >>>> from scipy import cluster
> >>>> v = np.array([1,2,3,4,10], dtype=float)
> >>>> cluster.vq.kmeans(v, 1, thresh=1e15)
> >>   (array([ 4.]), 2.3999999999999999)
> >>>> cluster.vq.kmeans(v, 1, thresh=1e16)
> >> <snip>
> >> IndexError: list index out of range
> >>
> >> The problem is in these lines:
> >>
> >>    diff = thresh+1.
> >>    while diff > thresh:
> >>        <snip>
> >>        if(diff > thresh):
> >>
> >> If thresh is large then (thresh + 1) > thresh is False:
> >>
> >>>> thresh = 1e16
> >>>> diff = thresh + 1.0
> >>>> diff > thresh
> >>   False
> >>
> >> What's a use case for a large threshold? You might want to study the
> >> algorithm by seeing the result after one iteration (not to be confused
> >> with the iter input which is something else).
> >>
> >> One fix is to use 2*thresh instead for thresh + 1. But that just
> >> pushes the problem out to higher thresholds
> >
> > Or just use the spacing function, which by definition returns the
> > smallest number M such as thresh + M > thresh (except for nan/inf)
>
> Neat, I've never heard of np.spacing. But it suffers the same fate:
>
> Works:
>
> >> thresh = 1e16
> >> diff = thresh + np.spacing(thresh)
> >> diff > thresh
>   True
>
> Doesn't work:
>
> >> thresh = 1e400
> >> diff = thresh + np.spacing(thresh)
> >> diff > thresh
>   False
>
> len(avg_dist) == 0 could be used to mark the first time through the loop.
>
> Another minor issue:
>
> The kmeans docstring says iteration stops when the change in
> distortion is less than threshold. But as coded (if diff > thresh)
> iteration also stops when the change is equal to the threshold.
>
> Could either fix the code or the docstring. Fixing the code (if diff
> >= thresh) means that thresh=0 could enter an infinite loop (negative
> thresh already enters an infinite loop). So fixing the docstring seems
> better.
>
>
I have updated the docstring via the wiki.  There are probably a few more
changes that needs to be done before marking it as ready for release.

http://docs.scipy.org/scipy/docs/scipy.cluster.vq.kmeans/

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20100725/9cdb0e4b/attachment-0001.html
```