[SciPy-User] kmeans and initial centroid guesses

Keith Goodman kwgoodman@gmail....
Tue Dec 29 12:09:53 CST 2009


On Sun, Dec 27, 2009 at 6:07 PM, Keith Goodman <kwgoodman@gmail.com> wrote:
> On Sun, Dec 27, 2009 at 5:47 PM, David Cournapeau <cournape@gmail.com> wrote:
>> On Mon, Dec 28, 2009 at 10:37 AM, Keith Goodman <kwgoodman@gmail.com> wrote:
>>> The kmeans function has two modes. In one of the modes the initial
>>> guesses for the centroids are randomly selected from the input data.
>>> The selection is currently done with replacement:
>>>
>>> guess = take(obs, randint(0, No, k), 0)
>>>
>>> That means some of the centroids in the intial guess might be the
>>> same. Wouldn't it be better to select without replacement?
>>
>> I think you are right, but random sampling without replacement for
>> floating point values is a bit hard to use here: if two values are
>> different but very close, you would see the same effect, right ?
>>
>> Generally, for clustering algorithms, I think you'd you want to start
>> with centroids as far from each other as possible, so maybe the code
>> could be improved taking this into account.
>
> That's a good point. And it sounds like an interesting problem to find
> the set of k points that are farthest apart. But the time it takes to
> find far apart points could be used to do more iterations of randomly
> selected points. I see that kmeans2 has a few different methods for
> selecting the initial points. A method like the one you suggest could
> be added there. (Maybe use pdist to select the two farthest apart
> points and then select the point that is furthest away from the first
> two points and so on. It would get slow if a lot of points were
> needed.) For kmeans I'll file a ticket to draw without replacement.

I opened a ticket and created a patch:

http://projects.scipy.org/scipy/ticket/1078


More information about the SciPy-User mailing list