[Numpy-discussion] Proposal: scipy.spatial

Anne Archibald peridot.faceted@gmail....
Sun Oct 12 09:17:39 CDT 2008


2008/10/9 David Bolme <bolme1234@comcast.net>:
> I have written up basic nearest neighbor algorithm.  It does a brute
> force search so it will be slower than kdtrees as the number of points
> gets large.  It should however work well for high dimensional data.  I
> have also added the option for user defined distance measures.  The
> user can set a default "p".  "p" has the same functionality if it is a
> float.  "p" can also be a function that computes a distance matrix or
> the measure can be selected using the strings: "Manhattan",
> "Euclidean", or "Correlation".
>
> https://pyvision.svn.sourceforge.net/svnroot/pyvision/trunk/src/pyvision/vector/knn.py

This is interesting. I would point out, though, that if you want a
Minkowski norm, it may be more efficient (that is, faster) to use the
new compiled kd-tree implementation with leafsize set to the size of
your data. This is written in compiled code and uses short-circuit
distance evaluation, and may be much faster for high-dimensional
problems.

Given that, this should perhaps go in with other generic metric space
code. I have a functional implementation of ball trees (though I don't
know how efficient they are), and am looking into implementing cover
trees.

> The interface is similar to Anne's code and in many cases can be used
> as a drop in replacement.  I have posted the code to my own project
> because I have a short term need and I do not have access to the scipy
> repository.  Feel free to include the code with scipy under the scipy
> license.
>
> I did find a typo your documentation.
> typo "trie -> tree" - ... kd-tree is a binary trie, each of whose ...

That's not actually a typo: a trie is a tree in which all the data is
stored at leaf nodes. Basic kd-trees use the nodes themselves to
define splitting planes; you can actually construct one with no extra
storage at all, just by choosing an appropriate order for your array
of points. This implementation chooses splitting planes that may not
pass through any point, so all the points get stored in leaves.

> Also I found the use of k in the documentation some what confusing as
> it is the dimensionality of the data points in the kd-tree and the
> number of neighbors for k-nearest neighbors.

That's a good point. I changed the dimension of the kd-tree to m throughout.

Anne


More information about the Numpy-discussion mailing list