[SciPy-User] kmeans

Lutz Maibaum lutz.maibaum@gmail....
Fri Jul 23 14:06:13 CDT 2010

On Fri, Jul 23, 2010 at 11:54 AM, Keith Goodman <kwgoodman@gmail.com> wrote:
> On Fri, Jul 23, 2010 at 11:39 AM, Lutz Maibaum <lutz.maibaum@gmail.com> wrote:
>> To be compatible with the (at least to me!) standard use of k-means, I
>> think both code and doc should use the sum of squared distances as the
>> cost function in the optimization, and also as the return value.
> What about the thresh (threshold) input parameter? If the sum of
> squares were used then the user would have to adjust the threshold for
> the number of data points.

That's true, but personally I don't find that much of a problem. Using
an absolute threshold one needs to have some intuition about the
magnitude of the cost function based on the type and amount of data.
Alternatively, one could use a relative improvement as the convergence
criterion, for example (something like "if
(old_cost-new_cost)/old_cost < threshhold then converged"), which may
be suitable for a larger variety of clustering problems.

  -- Lutz

More information about the SciPy-User mailing list