[SciPy-User] kmeans

Lutz Maibaum lutz.maibaum@gmail....
Fri Jul 23 13:23:43 CDT 2010


> Examining further, I see that SciPy's implementation is fairly simplistic
> and has some issues.  In the given example, the reason why 3 is never
> returned is not because of the use of the distortion metric, but rather
> because the kmeans function never sees the distance for using 3.  As a
> matter of fact, the actual code that does the convergence is in vq and py_vq
> (vector quantization) and it tries to minimize the sum of squared errors.
> kmeans just keeps on retrying the convergence with random guesses to see if
> different convergences occur.

At least to me, this is pretty much the definition of the k-means
algorithm. To be more precise, it is the "standard algorithm" that
finds a solution to the k-means optimization problem (to minimize the
intra-cluster variance) which doesn't necessarily correspond to the
global mimimum (see, for example,
http://en.wikipedia.org/wiki/K-means_clustering). I agree that it
would be much more natural if the resulting sum of squared distances
were returned, since this is the optimization function.

> Is Pycluster even maintained anymore?  Maybe we should look into integrating
> it into SciPy if it isn't being maintained.

As far as I can tell, Pycluster does pretty much the same thing.

One improvement that I would suggest is that the kmeans algorithm
performs its calculation in a floating point data type if given
integer values, which would make it more compatible with np.mean(). At
least there should be a warning in the documentation that it doesn't.
For example, right now I get the following:

In [58]: cluster.vq.kmeans(np.array([1,2]), 1)
Out[58]: (array([1]), 0.5)

In [59]: cluster.vq.kmeans(np.array([1.,2.]), 1)
Out[59]: (array([ 1.5]), 0.5)

In [60]: np.mean(np.array([1,2]))
Out[60]: 1.5

Best,

  Lutz


More information about the SciPy-User mailing list