[SciPy-User] kmeans

Keith Goodman kwgoodman@gmail....
Fri Jul 23 14:53:13 CDT 2010


On Fri, Jul 23, 2010 at 12:40 PM, Benjamin Root <ben.root@ou.edu> wrote:
> On Fri, Jul 23, 2010 at 2:06 PM, Lutz Maibaum <lutz.maibaum@gmail.com>
> wrote:
>>
>> On Fri, Jul 23, 2010 at 11:54 AM, Keith Goodman <kwgoodman@gmail.com>
>> wrote:
>> > On Fri, Jul 23, 2010 at 11:39 AM, Lutz Maibaum <lutz.maibaum@gmail.com>
>> > wrote:
>> >> To be compatible with the (at least to me!) standard use of k-means, I
>> >> think both code and doc should use the sum of squared distances as the
>> >> cost function in the optimization, and also as the return value.
>> >
>> > What about the thresh (threshold) input parameter? If the sum of
>> > squares were used then the user would have to adjust the threshold for
>> > the number of data points.
>>
>> That's true, but personally I don't find that much of a problem. Using
>> an absolute threshold one needs to have some intuition about the
>> magnitude of the cost function based on the type and amount of data.
>> Alternatively, one could use a relative improvement as the convergence
>> criterion, for example (something like "if
>> (old_cost-new_cost)/old_cost < threshhold then converged"), which may
>> be suitable for a larger variety of clustering problems.
>>
>>  -- Lutz
>
> However, we wouldn't want to change the characteristic behavior of kmeans...
> yet.

That's a good point. Are all these considered "bugs"?

- Switch code and doc to use rmse
- Integer bug
- Select initial centroids without replacement

> Personally, I never liked using tolerances and thresholds for stopping
> conditions,
> which is why I like the C Clustering library's approach of iterating until
> there are
> no more reassignments (or max iterations).  Although, I can't remember how
> it
> handles the edge case of assignments getting passed back and forth between
> members.
>
> Just to be clear, the C Clustering library's implementation of kmeans is
> entirely
> different from SciPy's implementation.  While I am certainly no expert in
> determining
> which approach is better than another, I can say that I have used it before
> and it has
> worked very nicely for me and my uses.
>
> Ben Root
>
> P.S. - As a complete side-note, while I am in this nostalgic fervor, a
> particularly clever use
> of kmeans/kmedians that I came up with was to 'snap' similar grids to a
> common grid without requiring
> one to predefine that grid.
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>


More information about the SciPy-User mailing list