[SciPy-user] Kmeans help and C source

eric eric at scipy.org
Fri Jan 11 16:17:02 CST 2002


> At some point, I will probably need this in C...

That transition will not be difficult.

> I know I am looking too far
> ahead, but probably my vector-space will not fit into RAM.  Does anybody
> know any tricks for clustering "pages" of the v-space or otherwise fit an
> 1e8 observations cube into a kmeans algorithm?

Is your data fairly uniformly distributed, i.e. if you take the first 1e6
values do they
represent the entire data set fairly well?  If so, you can run kmeans on
this subset
of data to find your code_book, and then read in the rest of the data in
chunks to
do your classification.

If you need to run kmeans on the entire 1e8 space, then you'll have to do
this "chunking"
within the kmeans algorithm, and it will become more complicated.  Your also
talking
about long run times.  vq is pretty time intensive, and kmeans calls vq a
ton
of times.

One other thing.  The new NumArray package that is the next generation of
NumPy has
memory mapped arrays built into it.  A very early .0x release is available
at the NumPy site.
Probably not usable in production yet at all, but sounds like it might match
your needs.

see ya,
eric






More information about the SciPy-user mailing list