[Numpy-discussion] Proposal: scipy.spatial
Fri Oct 3 22:03:08 CDT 2008
2008/10/3 David Bolme <firstname.lastname@example.org>:
> I remember reading a paper or book that stated that for data that has
> been normalized correlation and Euclidean are equivalent and will
> produce the same knn results. To this end I spent a couple hours this
> afternoon doing the math. This document is the result.
Yes, you're right, even without the mean subtraction they all lie on a
hypersphere in Euclidean space. It's a little more awkward if you want
to identify x and -x.
> I believe that with mean subtracted and unit length vectors, a
> Euclidean knn algorithm will produces the same result as if the
> vectors were compared using correlation. I am not sure if kd-trees
> will perform well on the normalized vectors as they have a very
> specific geometry. If my math checks out it may be worth adding
> Pearson's correlation as a default option or as a separate class.
Actually it's probably easier if the user just does the prenormalization.
> I have also spent a little time looking at kd-trees and the kdtree
> code. It looks good. As I understand it kd-trees only work well when
> the number of datapoints (N) is larger than 2^D, where D is the
> dimensionality of those points. This will not work well for many of
> my computer vision problems because often D is large. As Anne
> suggested I will probably look at cover trees because often times the
> data are "low-dimensional data in high-dimensional spaces". I have
> been told though that with a large D there is know known fast
> algorithm for knn.
Pretty much true. Though if the intrinsic dimensionality is low, cover
trees should be all right.
> Another problem is that the distances and similarity measures used in
> biometrics and computer vision are often very specialized and may or
> may not conform to the underlying assumptions of fast algorithms. I
> think for this reason I will need an exhaustive search algorithm. I
> will code it up modeled after Anne's interface and hopefully it will
> make it into the spatial module.
Metric spaces are quite general - edit distance for strings, for
example, is an adequate distance measure. But brute-force is
definitely worth having too.
If I get the test suite cleaned up, it should be possible to just drop
an arbitrary k-nearest-neighbors class into it and get a reasonably
thorough collection of unit tests.
More information about the Numpy-discussion