# [SciPy-user] Scipy for Cluster Detection

David Warde-Farley dwf@cs.toronto....
Thu Dec 27 00:24:26 CST 2007

```On 26-Dec-07, at 7:28 PM, Lorenzo Isella wrote:

> Dear All,
> I hope this will not sound too off-topic.
> Without going into details, I am simulating the behavior of a set of
> particles interacting with a short-ranged, strongly binding
> potential. I
> start with a number N of particles which undergo collisions with each
> other and, in doing so, they stick giving rise to clusters.
> The code which does that is not written in Python, but it returns the
> positions of these particles in space.
> As time goes on, more and more particles coagulate around one of these
> clusters. Eventually, they will all end up in the same cluster.
> Two particles are considered to be bounded if their distance falls
> below
> a certain threshold d.
> A cluster is nothing else than a group of particles all directly or
> indirectly bound together.

This doesn't sound like what would typically be considered a

Do you have code that produces the (N^2 - N)/2 distances between pairs
of particles? How large is N? If it's small enough, turning it into a
NumPy array should be straightforward, and then thresholding at d is
one line of code. Then what you end up with is a boolean matrix that
where the (i,j) entry denotes that i and j are in a cluster together.
This can be thought of as a graph or network where the nodes are
particles and connections / links / edges exist between those
particles that are in a cluster together.

Once you have this your problem is what graph theorists would call
identifying the "connected components", which is a well-studied
problem and can be done fairly easily, Google "finding connected
components" or pick up any algorithms textbook at your local library
such as the Cormen et al "Introduction to Algorithms".

I'm not in a position to comment on whether there's something like
this already in SciPy, as I really don't know. However, finding
connected components of a graph (i.e. your clusters) is a very well
studied problem in the computer science (namely graph theory)
literature, and the algorithms are simple enough that less than 20
lines of Python should do the trick provided the distance matrix is
manageable.

Regards,

David
```