[SciPy-user] 2D clustering question

Damian Eads eads@soe.ucsc....
Tue May 12 19:35:41 CDT 2009


Hi Hazen,

Sorry for getting back to you so late. I was traveling a lot, and I'm
just now catching up on my e-mail.

Without knowing much about the details of your problem, I imagine a
lot of the slowness is caused by the allocation of an exceptionally
large distance matrix, which grows n^2 in the number of points n. In
some cases, there is no need for a distance matrix representation
since many points will be too far from each other (sparsity in the
sense that many entries will be large) because they belong to
different clusters. Agglomerative clustering may not be helpful when
the distance between points is too large. One solution is first run
k-means, and then run agglomerative clustering individually on each
cluster generated by k-means. This is less expensive because you are
allocating several smaller distance matrices over one big one.
Optionally, as a third step, you can run agglomerative clustering on
the centroids of these clusters to get a coarse approximation of how
the clusters relate to one another. Please keep in mind this idea may
be completely inappropriate for your data. In other cases, there will
be many points close to one another (sparsity in that many entries are
small or close to zero), in which case it may be worth to filter out
points.

Someone suggested a while back adding a feature to the hierarchical
clustering code so that distance matrices aren't used and distances
are only computed when they're needed. I don't know if anyone has made
any progress on that.

Cheers,

Damian


On Mon, May 4, 2009 at 4:06 PM, Hazen Babcock <hbabcock@mac.com> wrote:
>
> Hello,
>
> I've been using scipy.cluster.hierarchy.fclusterdata() to cluster groups
> of points based on their x and y position. This works well for data sets
> without out too many points, but seems to get pretty slow as the number
> of points gets into the high thousands (i.e. 6000+). Does anyone know of
> a more specialized clustering algorithm that might be able to handle
> even larger numbers of points, i.e. up to 10e6 or so? The points are
> spread out over 0 - 200 or so in X and Y and I'm clustering with a 0.5
> cutoff. One approach is to break the data set down into smaller sections
> based on X,Y coordinate, but perhaps something like this already exists?
>
> thanks,
> -Hazen
>
> _______________________________________________
> SciPy-user mailing list
> SciPy-user@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

-----------------------------------------------------
Damian Eads                             Ph.D. Candidate
Jack Baskin School of Engineering, UCSC        E2-489
1156 High Street                 Machine Learning Lab
Santa Cruz, CA 95064    http://www.soe.ucsc.edu/~eads


More information about the SciPy-user mailing list