[Numpy-discussion] Proposal: scipy.spatial
Fri Oct 3 19:25:25 CDT 2008
I remember reading a paper or book that stated that for data that has
been normalized correlation and Euclidean are equivalent and will
produce the same knn results. To this end I spent a couple hours this
afternoon doing the math. This document is the result.
I believe that with mean subtracted and unit length vectors, a
Euclidean knn algorithm will produces the same result as if the
vectors were compared using correlation. I am not sure if kd-trees
will perform well on the normalized vectors as they have a very
specific geometry. If my math checks out it may be worth adding
Pearson's correlation as a default option or as a separate class.
I have also spent a little time looking at kd-trees and the kdtree
code. It looks good. As I understand it kd-trees only work well when
the number of datapoints (N) is larger than 2^D, where D is the
dimensionality of those points. This will not work well for many of
my computer vision problems because often D is large. As Anne
suggested I will probably look at cover trees because often times the
data are "low-dimensional data in high-dimensional spaces". I have
been told though that with a large D there is know known fast
algorithm for knn.
Another problem is that the distances and similarity measures used in
biometrics and computer vision are often very specialized and may or
may not conform to the underlying assumptions of fast algorithms. I
think for this reason I will need an exhaustive search algorithm. I
will code it up modeled after Anne's interface and hopefully it will
make it into the spatial module.
I think that kd-trees and the spatial module are a good contribution
to scipy. I have also enjoyed learning more about norms, correlation,
and fast knn algorithms.
More information about the Numpy-discussion