[SciPy-User] kmeans
Benjamin Root
ben.root@ou....
Sun Jul 25 17:11:12 CDT 2010
On Sun, Jul 25, 2010 at 4:53 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
> On Sun, Jul 25, 2010 at 12:41 PM, David Cournapeau <cournape@gmail.com>
> wrote:
> > On Sun, Jul 25, 2010 at 2:36 AM, Keith Goodman <kwgoodman@gmail.com>
> wrote:
> >> _kmeans chokes on large thresholds:
> >>
> >>>> from scipy import cluster
> >>>> v = np.array([1,2,3,4,10], dtype=float)
> >>>> cluster.vq.kmeans(v, 1, thresh=1e15)
> >> (array([ 4.]), 2.3999999999999999)
> >>>> cluster.vq.kmeans(v, 1, thresh=1e16)
> >> <snip>
> >> IndexError: list index out of range
> >>
> >> The problem is in these lines:
> >>
> >> diff = thresh+1.
> >> while diff > thresh:
> >> <snip>
> >> if(diff > thresh):
> >>
> >> If thresh is large then (thresh + 1) > thresh is False:
> >>
> >>>> thresh = 1e16
> >>>> diff = thresh + 1.0
> >>>> diff > thresh
> >> False
> >>
> >> What's a use case for a large threshold? You might want to study the
> >> algorithm by seeing the result after one iteration (not to be confused
> >> with the iter input which is something else).
> >>
> >> One fix is to use 2*thresh instead for thresh + 1. But that just
> >> pushes the problem out to higher thresholds
> >
> > Or just use the spacing function, which by definition returns the
> > smallest number M such as thresh + M > thresh (except for nan/inf)
>
> Neat, I've never heard of np.spacing. But it suffers the same fate:
>
> Works:
>
> >> thresh = 1e16
> >> diff = thresh + np.spacing(thresh)
> >> diff > thresh
> True
>
> Doesn't work:
>
> >> thresh = 1e400
> >> diff = thresh + np.spacing(thresh)
> >> diff > thresh
> False
>
> len(avg_dist) == 0 could be used to mark the first time through the loop.
>
> Another minor issue:
>
> The kmeans docstring says iteration stops when the change in
> distortion is less than threshold. But as coded (if diff > thresh)
> iteration also stops when the change is equal to the threshold.
>
> Could either fix the code or the docstring. Fixing the code (if diff
> >= thresh) means that thresh=0 could enter an infinite loop (negative
> thresh already enters an infinite loop). So fixing the docstring seems
> better.
>
>
I have updated the docstring via the wiki. There are probably a few more
changes that needs to be done before marking it as ready for release.
http://docs.scipy.org/scipy/docs/scipy.cluster.vq.kmeans/
Ben Root
