# [SciPy-User] kmeans and initial centroid guesses

Keith Goodman kwgoodman@gmail....
Sun Dec 27 19:37:21 CST 2009

guesses for the centroids are randomly selected from the input data.
The selection is currently done with replacement:

guess = take(obs, randint(0, No, k), 0)

That means some of the centroids in the intial guess might be the
same. Wouldn't it be better to select without replacement? Something
like

guess = take(obs, rand(No).argsort()[:k], 0)

Here's an extreme example of what can go wrong if the selection is
done with replacement:

>> obs

array([[ 1,  1],
[-1, -1],
[-1,  1],
[ 1, -1]])
>> vq.kmeans(obs, k_or_guess=4)

(array([[-1, -1],
[-1,  1],
[ 1, -1],
[ 1,  1]]), 0.0) # <--- good
>>
>> k_or_guess = obs[[1,1,1,1],:]
>> k_or_guess

array([[-1, -1],
[-1, -1],
[-1, -1],
[-1, -1]])
>> vq.kmeans(obs, k_or_guess)
(array([[0, 0]]), 1.4142135623730951) # <--- not as good

In most cases it won't make any difference. But the cost of the code
change is small.
