[SciPy-User] kmeans and initial centroid guesses

David Cournapeau cournape@gmail....
Sun Dec 27 19:47:45 CST 2009

On Mon, Dec 28, 2009 at 10:37 AM, Keith Goodman <kwgoodman@gmail.com> wrote:
> The kmeans function has two modes. In one of the modes the initial
> guesses for the centroids are randomly selected from the input data.
> The selection is currently done with replacement:
> guess = take(obs, randint(0, No, k), 0)
> That means some of the centroids in the intial guess might be the
> same. Wouldn't it be better to select without replacement?

I think you are right, but random sampling without replacement for
floating point values is a bit hard to use here: if two values are
different but very close, you would see the same effect, right ?

Generally, for clustering algorithms, I think you'd you want to start
with centroids as far from each other as possible, so maybe the code
could be improved taking this into account.



More information about the SciPy-User mailing list