[SciPy-Dev] Cover trees for nearest neighbors in general metric space
Thu Mar 8 19:23:57 CST 2012
Following up from the conversation with Emanuele Olivetti and Jake
VanderPlas, I've implemented  a drop-in replacement for
scipy.spatial.kdtree that uses cover trees instead of kd-trees to answer
nearest neighbor queries in a general metric space. To me, this is useful
for finding nearby particles in a 3D periodic box in the context of
molecular simulations, but I'm sure it's more generally useful. It
addresses the same problem that Jake's BallTrees code addresses in
scikit-learn, but I've done my best to reproduce the API of
scipy.spatial.kdtree in order to make this code mostly painless to use. In
particular, kd-tree's useful function for finding all the points in one
tree that are neighbors of every point in another tree (ironically,
"query_ball_tree") is also implemented here for cover trees. I modified
kdtree's extensive unit test to use cover trees, and the code passes it.
The code is informed by what I learned from reading Emanuele's PyCoverTree
(original authors Thomas Kollar and Nil Geisweiller), but at the end of the
day, the final internal representation and algorithm implementations are
quite different. For example, the tree is quickly built in one go at the
beginning using the Batch Construction algorithm in the cover tree paper,
but is then immutable. For running a query, the algorithm is implemented
in a way that's much closer to what is done in a kd-tree, instead of how
it's presented in the paper, which IMHO is somewhat cumbersome.
This is definitely work in progress, and its performance could certainly be
improved. But since it may already be useful to others, I'd like to put it
out there to get some early feedback. Two things that it *doesn't* do yet
are use any vectorized form of a distance function (so every distance
calculation between two points costs one Python function call), and there's
no speedy Cython version yet. I'm hoping to learn enough Cython over the
coming weeks to address both of these shortcomings.
Finally, in reading carefully through kdtree.py, I found a clear bug in the
code, whereby the "eps" parameter for approximate queries doesn't get
forwarded from the externally visible function "query" to the internal
function "__query" that does the work.
All the best,
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the SciPy-Dev