Fri Jul 23 13:23:43 CDT 2010
> Examining further, I see that SciPy's implementation is fairly simplistic
> and has some issues. In the given example, the reason why 3 is never
> returned is not because of the use of the distortion metric, but rather
> because the kmeans function never sees the distance for using 3. As a
> matter of fact, the actual code that does the convergence is in vq and py_vq
> (vector quantization) and it tries to minimize the sum of squared errors.
> kmeans just keeps on retrying the convergence with random guesses to see if
> different convergences occur.
At least to me, this is pretty much the definition of the k-means
algorithm. To be more precise, it is the "standard algorithm" that
finds a solution to the k-means optimization problem (to minimize the
intra-cluster variance) which doesn't necessarily correspond to the
global mimimum (see, for example,
http://en.wikipedia.org/wiki/K-means_clustering). I agree that it
would be much more natural if the resulting sum of squared distances
were returned, since this is the optimization function.
> Is Pycluster even maintained anymore? Maybe we should look into integrating
> it into SciPy if it isn't being maintained.
As far as I can tell, Pycluster does pretty much the same thing.
One improvement that I would suggest is that the kmeans algorithm
performs its calculation in a floating point data type if given
integer values, which would make it more compatible with np.mean(). At
least there should be a warning in the documentation that it doesn't.
For example, right now I get the following:
In : cluster.vq.kmeans(np.array([1,2]), 1)
Out: (array(), 0.5)
In : cluster.vq.kmeans(np.array([1.,2.]), 1)
Out: (array([ 1.5]), 0.5)
In : np.mean(np.array([1,2]))
More information about the SciPy-User