[SciPy-user] Looking for a way to cluster data

Zachary Pincus zachary.pincus@yale....
Mon Apr 27 11:32:43 CDT 2009

Hi Gary,

> Using scipy.ndimage I label the 26-connected regions so that I can
> process each connected branch-like structure in turn. For each of  
> these,
> I form a graph with nodes corresponding to pixels and edges
> corresponding to the Minimum Spanning Tree weighted by euclidean
> distance. I then use NetworkX to traverse this graph, extracting the
> node visitation order as a single list. Whenever this list makes jumps
> of distance >sqrt(2), it is jumping to a different branch, so I  
> identify
> these jumps and subdivide the list into a list of lists, each  
> containing
> nodes containing only one branch. I then plot these using plot3d.

This sounds like a pretty good approach. Unfortunately, as you've  
recognized, any operations that loop over pixels in pure python are  
pretty much too slow to use for large images. I definitely think  
cython might be a good thing to look into.

Another possibility, if you were coding this in c or cython, would be  
to just do a flood-fill directly. (You could do the recursive  
algorithm, or just use a stack. Either way it's simple.) Every time  
you encounter a branch-point, increment a counter and use that new  
value for the "flood-fill" color. Then you have a labeled image where  
each branch has a given value. Conveniently, ndimage has lots of  
useful tools for dealing with such label images. You could also build  
up the connectivity graph during the flood-fill (nodes = branch  
points, edge weights = the number of pixels along each branch), for  
calculating the MST.

>> It sounds like what you want from the clustering is to get groups of
>> pixels that form straight-ish lines (and could thus be visualized
>> with 3D rods). Is this correct?
> Actually, I probably want curved lines so your later comments and
> suggestions about fitting poly-lines are what I may try next.

I have another idea that might be good: take the x,y,z coords of each  
pixel along each branch, and fit a parametric smoothing spline through  
them with scipy.interpolate.splprep (use a non-zero / non-None  
smoothing parameter 's'). Then to render the curves, just evaluate the  
spline (scipy.interpolate.splev) at however many internal points you  
want, to give a poly-line. Use more points to get better resolution...

I do this a lot when I need to "post-filter" binarized image data into  
smooth continuous arcs.

>> Most clustering algorithms are likely
>> to just give you groups of pixels that are nearby spatially -- which
>> is probably not exactly what you want, since if you (say) have a long
>> rod- structure, you'd want all the pixels in that rod grouped
>> together, and not grouped with separate rods that cross nearby in
>> space but aren't physically connected. So if you do want to cluster
>> the pixels, you'll need to use the geodesic distance between two
>> pixels, not the euclidian distance. But that still wouldn't give you
>> sets of rods... more like rods and blobs at the junctions.
> I'm not sure what you mean by geodesic distance here.

The geodesic distance between two pixels would be the shortest path  
*along the binarzied skeleton* between those pixels. As opposed to the  
shortest straight-line between the points. This would be relevant,  
say, if you have two branches that are nearby in space, but rather far  
from one another in terms of connectivity -- then using a euclidian  
distance to cluster pixels might put neaby pixels from the two  
different branches in the same cluster.

>> Having such a connectivity graph is a useful thing -- lots of neat
>> stuff one can calculate from that. In fact, I think you'd need to do
>> this to calculate geodesic distances anyway... Anyhow, you could
>> then try fitting the points from each branch to a poly-line (using
>> some information criterion for determining how many lines to use), to
>> simplify the representation down to rods for plotting.
> I'm planning to do this next. I'm hoping I can think of a very simple
> criterion.

Check out the standard, if maybe hokey, information criteria for model  
selection, like AIC and BIC. They offer semi-principled ways to trade  
off goodness-of-fit vs. model parsimony.
http://en.wikipedia.org/wiki/Akaike_information_criterion and related  


More information about the SciPy-user mailing list