[SciPy-user] Mysterious kmeans() error

josef.pktd@gmai... josef.pktd@gmai...
Fri Feb 6 10:25:31 CST 2009


On Fri, Feb 6, 2009 at 11:05 AM, David Cournapeau <cournape@gmail.com> wrote:
> On Fri, Feb 6, 2009 at 11:37 PM, Roy H. Han
> <starsareblueandfaraway@gmail.com> wrote:
>> Well I feel like there are numerical problems with scipy's kmeans2(),
>> at least in the 0.6.0 version of scipy.
>
> kmeans and kmeans2 are fairly low level - they will fail if you have
> empty cluster, indeed.

I thought that the tests  test_kmeans_lost_cluster(self) verifies that
empty clusters
are handled.


>
>> I changed the code to try to ensure that no clusters were empty.
>> Pycluster seems to be the better clustering algorithm for now.
>
> Maybe - I am not familiar with pycluster.
>
>> Even though the size (number of columns = 3) of each vector in the
>> cluster is three, kmeans should still work even if one of the clusters
>> contained a single vector (number of rows = 1).
>
> Strictly speaking, kmeans is undefined in that case - there are
> various strategies which can be implemented, like cluster splitting,
> etc... Generally, I agree the code is not great.
>
> David

If the problem is just the cholesky decomposition in the random
initialization, then it should be possible to switch to a different
initialization scheme, or force a correct covariance matrix for the
cholesky decomposition. Eg. replace with diagonal matrix or, ensure
that cov has the right dimension and add a small diagonal array (as in
Ridge regression or Tychonov penalization).

Josef


More information about the SciPy-user mailing list