Fri Jul 23 15:01:28 CDT 2010
On Fri, Jul 23, 2010 at 2:53 PM, Keith Goodman <firstname.lastname@example.org> wrote:
> On Fri, Jul 23, 2010 at 12:40 PM, Benjamin Root <email@example.com> wrote:
> > On Fri, Jul 23, 2010 at 2:06 PM, Lutz Maibaum <firstname.lastname@example.org>
> > wrote:
> >> On Fri, Jul 23, 2010 at 11:54 AM, Keith Goodman <email@example.com>
> >> wrote:
> >> > On Fri, Jul 23, 2010 at 11:39 AM, Lutz Maibaum <
> >> > wrote:
> >> >> To be compatible with the (at least to me!) standard use of k-means,
> >> >> think both code and doc should use the sum of squared distances as
> >> >> cost function in the optimization, and also as the return value.
> >> >
> >> > What about the thresh (threshold) input parameter? If the sum of
> >> > squares were used then the user would have to adjust the threshold for
> >> > the number of data points.
> >> That's true, but personally I don't find that much of a problem. Using
> >> an absolute threshold one needs to have some intuition about the
> >> magnitude of the cost function based on the type and amount of data.
> >> Alternatively, one could use a relative improvement as the convergence
> >> criterion, for example (something like "if
> >> (old_cost-new_cost)/old_cost < threshhold then converged"), which may
> >> be suitable for a larger variety of clustering problems.
> >> -- Lutz
> > However, we wouldn't want to change the characteristic behavior of
> > yet.
> That's a good point. Are all these considered "bugs"?
> - Switch code and doc to use rmse
> - Integer bug
> - Select initial centroids without replacement
My vote is yes, although I am not 100% convinced that the integer bug should
be changed because it may cause breakage with those who have been depending
on integer output.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the SciPy-User