[Numpy-discussion] Proposal: scipy.spatial
Charles R Harris
Mon Sep 29 23:11:09 CDT 2008
On Mon, Sep 29, 2008 at 9:24 PM, Anne Archibald
> Once again there has been a thread on the numpy/scipy mailing lists
> requesting (essentially) some form of spatial data structure. Pointers
> have been posted to ANN (sadly LGPLed and in C++) as well as a handful
> of pure-python implementations of kd-trees. I suggest the creation of
> a new submodule of scipy, scipy.spatial, to contain spatial data
> structures and algorithms. Specifically, I propose it contain a
> kd-tree implementation providing nearest-neighbor, approximate
> nearest-neighbor, and all-points-near queries. I have a few other
> suggestions for things it might contain, but kd-trees seem like a good
> first step.
> 2008/9/27 Nathan Bell <firstname.lastname@example.org>:
> > On Sat, Sep 27, 2008 at 11:18 PM, Anne Archibald
> > <email@example.com> wrote:
> >> I think a kd-tree implementation would be a valuable addition to
> >> scipy, perhaps in a submodule scipy.spatial that might eventually
> >> contain other spatial data structures and algorithms. What do you
> >> think? Should we have one? Should it be based on Sturla Molden's code,
> >> if the license permits? I am willing to contribute one, if not.
> > +1
> Judging that your vote and mine are enough in the absence of
> dissenting voices, I have written an implementation based on yours,
> Sturla Molden's, and the algorithms described by the authors of the
> ANN library. Before integrating it into scipy, though, I'd like to
> send it around for comments.
> Particular issues:
> * It's pure python right now, but with some effort it could be
> partially or completely cythonized. This is probably a good idea in
> the long run. In the meantime I have crudely vectorized it so that
> users can at least avoid looping in their own code.
> * It is able to return the r nearest neighbors, optionally subject to
> a maximum distance limit, as well as approximate nearest neighbors.
> * It is not currently set up to conveniently return all points within
> some fixed distance of the target point, but this could easily be
> * It returns distances and indices of nearest neighbors in the original
> * The testing code is, frankly, a mess. I need to look into using nose
> in a sensible fashion.
> * The license is the scipy license.
> I am particularly concerned about providing a convenient return
> format. The natural return from a query is a list of neighbors, since
> it may have variable length (there may not be r elements in the tree,
> or you might have supplied a maximum distance which doesn't contain r
> points). For a single query, it's simple to return a python list
> (should it be sorted? currently it's a heap). But if you want to
> vectorize the process, storing the results in an array becomes
> cumbersome. One option is an object array full of lists; another,
> which, I currently use, is an array with nonexistent points marked
> with an infinite distance and an invalid index. A third would be to
> return masked arrays. How do you recommend handling this variable (or
> potentially-variable) sized output?
I think lists are the way to go. They should be pretty fast down at the C
level. I use them myself for collecting the elements of equivalence classes
and such that can be quite dynamic when new elements/relations are added.
> > If you're implementing one, I would highly recommend the
> > "left-balanced" kd-tree.
> Research suggests that at least in high dimension a more geometric
> balancing criterion can produce vastly better results. I used the
> "sliding midpoint" rule, which doesn't allow such a numerical
> balancing but does ensure that you don't have too many long skinny
> cells (since a sphere tends to cut very many of these).
> I've also been thinking about what else would go in scipy.spatial. I
> think it would be valuable to have a reasonably efficient distance
> matrix function (we seem to get that question a lot, and the answer's
> not trivial) as well as a sparse distance matrix function based on the
> kd-trees. The module could potentially also swallow the (currently
> sandboxed?) delaunay code.
It would be good to define the interface separately from the algorithm so
that the latter can be tweaked without affecting the API.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Numpy-discussion