[SciPy-User] kmeans

Keith Goodman kwgoodman@gmail....
Fri Jul 23 13:33:06 CDT 2010


On Fri, Jul 23, 2010 at 11:23 AM, Lutz Maibaum <lutz.maibaum@gmail.com> wrote:
>> Examining further, I see that SciPy's implementation is fairly simplistic
>> and has some issues.  In the given example, the reason why 3 is never
>> returned is not because of the use of the distortion metric, but rather
>> because the kmeans function never sees the distance for using 3.  As a
>> matter of fact, the actual code that does the convergence is in vq and py_vq
>> (vector quantization) and it tries to minimize the sum of squared errors.
>> kmeans just keeps on retrying the convergence with random guesses to see if
>> different convergences occur.
>
> At least to me, this is pretty much the definition of the k-means
> algorithm. To be more precise, it is the "standard algorithm" that
> finds a solution to the k-means optimization problem (to minimize the
> intra-cluster variance) which doesn't necessarily correspond to the
> global mimimum (see, for example,
> http://en.wikipedia.org/wiki/K-means_clustering). I agree that it
> would be much more natural if the resulting sum of squared distances
> were returned, since this is the optimization function.
>
>> Is Pycluster even maintained anymore?  Maybe we should look into integrating
>> it into SciPy if it isn't being maintained.
>
> As far as I can tell, Pycluster does pretty much the same thing.
>
> One improvement that I would suggest is that the kmeans algorithm
> performs its calculation in a floating point data type if given
> integer values, which would make it more compatible with np.mean(). At
> least there should be a warning in the documentation that it doesn't.
> For example, right now I get the following:
>
> In [58]: cluster.vq.kmeans(np.array([1,2]), 1)
> Out[58]: (array([1]), 0.5)
>
> In [59]: cluster.vq.kmeans(np.array([1.,2.]), 1)
> Out[59]: (array([ 1.5]), 0.5)
>
> In [60]: np.mean(np.array([1,2]))
> Out[60]: 1.5

Looks like a bug to me.

I think it makes sense to fix what is already there and then take the
time to look for new implementations. Big projects like new
implementations tend not to get done.

What needs to be fixed?

- Switch code and doc to use rmse
- Integer bug
- Select initial centroids without replacement


More information about the SciPy-User mailing list