[SciPy-User] kmeans

Benjamin Root ben.root@ou....
Fri Jul 23 16:55:20 CDT 2010

On Fri, Jul 23, 2010 at 4:18 PM, Lutz Maibaum <lutz.maibaum@gmail.com>wrote:

> On Jul 23, 2010, at 1:12 PM, Keith Goodman wrote:
> > On Fri, Jul 23, 2010 at 1:01 PM, Benjamin Root <ben.root@ou.edu> wrote:
> >> On Fri, Jul 23, 2010 at 2:53 PM, Keith Goodman <kwgoodman@gmail.com>
> wrote:
> >>> That's a good point. Are all these considered "bugs"?
> >>>
> >>> - Switch code and doc to use rmse
> >>> - Integer bug
> >>> - Select initial centroids without replacement
> >>
> >> My vote is yes, although I am not 100% convinced that the integer bug
> should
> >> be changed because it may cause breakage with those who have been
> depending
> >> on integer output.
> >
> > Maybe just make a ticket for now for the integer problem? Lutz, do you
> > want to make the ticket?
> I have opened a ticket (#1246). An easy and safe fix would be to simply add
> a statement to the docstring that warns the user about clustering integer
> data.
> > It would be nice to find a simple problem that gives the wrong
> > centroids due to the sum of dist bug. We could use that for a unit
> > test of the fix. The example given earlier in the thread returns the
> > right centroid. I guess we need a ticket for this one too.
> Actually, it not entirely clear to me anymore what the bug is. According to
> the k-means Wikipedia page, the objective function that the algorithm tries
> to minimize is the total intra-cluster variance (the sum of squares of
> distances of data points from cluster centroids). However, the two steps of
> the iteration (assignment to centroids, and centroid update) use regular
> distances and means. Is this not what the current code is doing?
Which is why I have been saying that there is no bug here because the code
is technically correct.  A mean of regular distances is a sum of squared
distances that has been divided.  The only reason why the current code is
not returning the correct answer for the given example is that it never
tries 3 as a centroid value.  This is a different issue.


One issue I see is that the documentation mentions that k-means tries to
> minimize distorition, defined as the sum of distances, which (at least
> according to the Wikipedia page) is not correct, because it tries to
> minimize the sum of squared distances.
Exactly... the documentation is wrong.  Kmeans works to minimize the sum of
squared distances.  How you define distances is up to you, so long as it
satisfies the triangle inequality.  And, as far as I can see, the code is
doing exactly this using euclidean distance.

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20100723/3c91385e/attachment.html 

More information about the SciPy-User mailing list