[SciPy-Dev] Cover trees for nearest neighbors in general metric space
Sat Mar 10 14:00:06 CST 2012
On Fri, Mar 9, 2012 at 2:23 AM, Patrick Varilly <email@example.com>wrote:
> Dear all,
> Following up from the conversation with Emanuele Olivetti and Jake
> VanderPlas, I've implemented  a drop-in replacement for
> scipy.spatial.kdtree that uses cover trees instead of kd-trees to answer
> nearest neighbor queries in a general metric space. To me, this is useful
> for finding nearby particles in a 3D periodic box in the context of
> molecular simulations, but I'm sure it's more generally useful. It
> addresses the same problem that Jake's BallTrees code addresses in
> scikit-learn, but I've done my best to reproduce the API of
> scipy.spatial.kdtree in order to make this code mostly painless to use. In
> particular, kd-tree's useful function for finding all the points in one
> tree that are neighbors of every point in another tree (ironically,
> "query_ball_tree") is also implemented here for cover trees. I modified
> kdtree's extensive unit test to use cover trees, and the code passes it.
Out of interest, are you planning to propose this for inclusion in scipy or
scikit-learn once it's done, or keep it as a standalone package?
> The code is informed by what I learned from reading Emanuele's PyCoverTree
> (original authors Thomas Kollar and Nil Geisweiller), but at the end of the
> day, the final internal representation and algorithm implementations are
> quite different. For example, the tree is quickly built in one go at the
> beginning using the Batch Construction algorithm in the cover tree paper,
> but is then immutable. For running a query, the algorithm is implemented
> in a way that's much closer to what is done in a kd-tree, instead of how
> it's presented in the paper, which IMHO is somewhat cumbersome.
> This is definitely work in progress, and its performance could certainly
> be improved. But since it may already be useful to others, I'd like to put
> it out there to get some early feedback. Two things that it *doesn't* do
> yet are use any vectorized form of a distance function (so every distance
> calculation between two points costs one Python function call), and there's
> no speedy Cython version yet.
Jaakko Luttinen proposed just two days ago to expose the distance functions
in scipy/spatial/src/distance.c. That may also be useful for you.
> I'm hoping to learn enough Cython over the coming weeks to address both of
> these shortcomings.
> Finally, in reading carefully through kdtree.py, I found a clear bug in
> the code, whereby the "eps" parameter for approximate queries doesn't get
> forwarded from the externally visible function "query" to the internal
> function "__query" that does the work.
Could you open a ticket for that with a little more detail? Or if you feel
like it, a pull request would be even better:)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the SciPy-Dev