[SciPy-Dev] Cover trees for nearest neighbors in general metric space

Patrick Varilly patvarilly@gmail....
Sat Mar 10 15:26:19 CST 2012

On Sat, Mar 10, 2012 at 8:00 PM, Ralf Gommers

> On Fri, Mar 9, 2012 at 2:23 AM, Patrick Varilly <patvarilly@gmail.com>wrote:
>> Dear all,
>> Following up from the conversation with Emanuele Olivetti and Jake
>> VanderPlas, I've implemented [0] a drop-in replacement for
>> scipy.spatial.kdtree that uses cover trees instead of kd-trees to answer
>> nearest neighbor queries in a general metric space.  To me, this is useful
>> for finding nearby particles in a 3D periodic box in the context of
>> molecular simulations, but I'm sure it's more generally useful.  It
>> addresses the same problem that Jake's BallTrees code addresses in
>> scikit-learn, but I've done my best to reproduce the API of
>> scipy.spatial.kdtree in order to make this code mostly painless to use.  In
>> particular, kd-tree's useful function for finding all the points in one
>> tree that are neighbors of every point in another tree (ironically,
>> "query_ball_tree") is also implemented here for cover trees.  I modified
>> kdtree's extensive unit test to use cover trees, and the code passes it.
> Out of interest, are you planning to propose this for inclusion in scipy
> or scikit-learn once it's done, or keep it as a standalone package?

Eventually, I'd like to propose it for inclusion in scipy, since this
functionality is not exclusive to machine learning.  I'm using it for
molecular simulations, and didn't even know that nearest-neighbor queries
were useful in machine learning!  I would never have though of looking in
scikit-learn for this.  But I would like to address the two outstanding
issues (vectorized distances and Cython implementation) before proposing it
for inclusion.  On that same vein, I don't know why BallTree is in
scikit-learn and not scipy, for the same reasons.

> This is definitely work in progress, and its performance could certainly
>> be improved.  But since it may already be useful to others, I'd like to put
>> it out there to get some early feedback.  Two things that it *doesn't* do
>> yet are use any vectorized form of a distance function (so every distance
>> calculation between two points costs one Python function call), and there's
>> no speedy Cython version yet.
> Jaakko Luttinen proposed just two days ago to expose the distance
> functions in scipy/spatial/src/distance.c. That may also be useful for you.

Thanks, I'll look into it when I get a chance.

> Finally, in reading carefully through kdtree.py, I found a clear bug in
>> the code, whereby the "eps" parameter for approximate queries doesn't get
>> forwarded from the externally visible function "query" to the internal
>> function "__query" that does the work.
> Could you open a ticket for that with a little more detail? Or if you feel
> like it, a pull request would be even better:)\

I sent in a pull request for it.  It's the first time I do this, so
apologies in advance if I somehow screwed it up.

All the best,

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-dev/attachments/20120310/31f6a55c/attachment.html 

More information about the SciPy-Dev mailing list