[SciPy-user] Looking for a way to cluster data
Zachary Pincus
zachary.pincus@yale....
Mon Apr 27 11:32:43 CDT 2009
Hi Gary,
> Using scipy.ndimage I label the 26-connected regions so that I can
> process each connected branch-like structure in turn. For each of
> these,
> I form a graph with nodes corresponding to pixels and edges
> corresponding to the Minimum Spanning Tree weighted by euclidean
> distance. I then use NetworkX to traverse this graph, extracting the
> node visitation order as a single list. Whenever this list makes jumps
> of distance >sqrt(2), it is jumping to a different branch, so I
> identify
> these jumps and subdivide the list into a list of lists, each
> containing
> nodes containing only one branch. I then plot these using plot3d.
This sounds like a pretty good approach. Unfortunately, as you've
recognized, any operations that loop over pixels in pure python are
pretty much too slow to use for large images. I definitely think
cython might be a good thing to look into.
Another possibility, if you were coding this in c or cython, would be
to just do a flood-fill directly. (You could do the recursive
algorithm, or just use a stack. Either way it's simple.) Every time
you encounter a branch-point, increment a counter and use that new
value for the "flood-fill" color. Then you have a labeled image where
each branch has a given value. Conveniently, ndimage has lots of
useful tools for dealing with such label images. You could also build
up the connectivity graph during the flood-fill (nodes = branch
points, edge weights = the number of pixels along each branch), for
calculating the MST.
>> It sounds like what you want from the clustering is to get groups of
>> pixels that form straight-ish lines (and could thus be visualized
>> with 3D rods). Is this correct?
>
> Actually, I probably want curved lines so your later comments and
> suggestions about fitting poly-lines are what I may try next.
I have another idea that might be good: take the x,y,z coords of each
pixel along each branch, and fit a parametric smoothing spline through
them with scipy.interpolate.splprep (use a non-zero / non-None
smoothing parameter 's'). Then to render the curves, just evaluate the
spline (scipy.interpolate.splev) at however many internal points you
want, to give a poly-line. Use more points to get better resolution...
I do this a lot when I need to "post-filter" binarized image data into
smooth continuous arcs.
>> Most clustering algorithms are likely
>> to just give you groups of pixels that are nearby spatially -- which
>> is probably not exactly what you want, since if you (say) have a long
>> rod- structure, you'd want all the pixels in that rod grouped
>> together, and not grouped with separate rods that cross nearby in
>> space but aren't physically connected. So if you do want to cluster
>> the pixels, you'll need to use the geodesic distance between two
>> pixels, not the euclidian distance. But that still wouldn't give you
>> sets of rods... more like rods and blobs at the junctions.
>
> I'm not sure what you mean by geodesic distance here.
The geodesic distance between two pixels would be the shortest path
*along the binarzied skeleton* between those pixels. As opposed to the
shortest straight-line between the points. This would be relevant,
say, if you have two branches that are nearby in space, but rather far
from one another in terms of connectivity -- then using a euclidian
distance to cluster pixels might put neaby pixels from the two
different branches in the same cluster.
>> Having such a connectivity graph is a useful thing -- lots of neat
>> stuff one can calculate from that. In fact, I think you'd need to do
>> this to calculate geodesic distances anyway... Anyhow, you could
>> then try fitting the points from each branch to a poly-line (using
>> some information criterion for determining how many lines to use), to
>> simplify the representation down to rods for plotting.
>
> I'm planning to do this next. I'm hoping I can think of a very simple
> criterion.
Check out the standard, if maybe hokey, information criteria for model
selection, like AIC and BIC. They offer semi-principled ways to trade
off goodness-of-fit vs. model parsimony.
http://en.wikipedia.org/wiki/Akaike_information_criterion and related
pages.
Zach
More information about the SciPy-user
mailing list