[Numpy-discussion] Proposal: scipy.spatial

David Bolme bolme1234@comcast....
Thu Oct 9 15:28:34 CDT 2008

I have written up basic nearest neighbor algorithm.  It does a brute  
force search so it will be slower than kdtrees as the number of points  
gets large.  It should however work well for high dimensional data.  I  
have also added the option for user defined distance measures.  The  
user can set a default "p".  "p" has the same functionality if it is a  
float.  "p" can also be a function that computes a distance matrix or  
the measure can be selected using the strings: "Manhattan",  
"Euclidean", or "Correlation".


The interface is similar to Anne's code and in many cases can be used  
as a drop in replacement.  I have posted the code to my own project  
because I have a short term need and I do not have access to the scipy  
repository.  Feel free to include the code with scipy under the scipy  

I did find a typo your documentation.
typo "trie -> tree" - ... kd-tree is a binary trie, each of whose ...

Also I found the use of k in the documentation some what confusing as  
it is the dimensionality of the data points in the kd-tree and the  
number of neighbors for k-nearest neighbors.

>> I believe that with mean subtracted and unit length vectors, a
>> Euclidean knn algorithm will produces the same result as if the
>> vectors were compared using correlation.  I am not sure if kd-trees
>> will perform well on the normalized vectors as they have a very
>> specific geometry.  If my math checks out it may be worth adding
>> Pearson's correlation as a default option or as a separate class.
> Actually it's probably easier if the user just does the  
> prenormalization.

I agree.  I think we should keep your class as-is and maybe create a  
class that wraps the kdtree and handles the normalization for  
correlation.  I would also like to look at cover trees, however that  
will have to wait a couple months until I have more time.


More information about the Numpy-discussion mailing list