[Numpy-discussion] Proposal: scipy.spatial

Anne Archibald peridot.faceted@gmail....
Fri Oct 3 22:03:08 CDT 2008


2008/10/3 David Bolme <bolme1234@comcast.net>:
> I remember reading a paper or book that stated that for data that has
> been normalized correlation and Euclidean are equivalent and will
> produce the same knn results.  To this end I spent a couple hours this
> afternoon doing the math.  This document is the result.
>
> http://www.cs.colostate.edu/~bolme/bolme08euclidean.pdf

Yes, you're right, even without the mean subtraction they all lie on a
hypersphere in Euclidean space. It's a little more awkward if you want
to identify x and -x.

> I believe that with mean subtracted and unit length vectors, a
> Euclidean knn algorithm will produces the same result as if the
> vectors were compared using correlation.  I am not sure if kd-trees
> will perform well on the normalized vectors as they have a very
> specific geometry.  If my math checks out it may be worth adding
> Pearson's correlation as a default option or as a separate class.

Actually it's probably easier if the user just does the prenormalization.

> I have also spent a little time looking at kd-trees and the kdtree
> code.  It looks good.  As I understand it kd-trees only work well when
> the number of datapoints (N) is larger than 2^D, where D is the
> dimensionality of those points.  This will not work well for many of
> my computer vision problems because often D is large.  As Anne
> suggested I will probably look at cover trees because often times the
> data are "low-dimensional data in high-dimensional spaces".  I have
> been told though that with a large D there is know known fast
> algorithm for knn.

Pretty much true. Though if the intrinsic dimensionality is low, cover
trees should be all right.

> Another problem is that the distances and similarity measures used in
> biometrics and computer vision are often very specialized and may or
> may not conform to the underlying assumptions of fast algorithms.  I
> think for this reason I will need an exhaustive search algorithm.  I
> will code it up modeled after Anne's interface and hopefully it will
> make it into the spatial module.

Metric spaces are quite general - edit distance for strings, for
example, is an adequate distance measure. But brute-force is
definitely worth having too.

If I get the test suite cleaned up, it should be possible to just drop
an arbitrary k-nearest-neighbors class into it and get a reasonably
thorough collection of unit tests.

Anne


More information about the Numpy-discussion mailing list