[SciPy-Dev] Distance Metrics
Tue Jan 10 23:59:15 CST 2012
I've been working on a little project lately centered around distance
metrics ( https://github.com/jakevdp/pyDistances ). The idea was to
create a set of cython distance metrics that can be called as normal
from python with numpy arrays, but which also expose low-level C
function pointers so that the same metrics can be called directly on
memory buffers from within cythonized tree-based KNN searches (KD Tree,
Ball Tree, etc.), without any python overhead.
I initially had in mind developing this for scikit-learn in order to
extend the capability of Ball Tree, but it occurred to me that this
might be nice to have in scipy as well. The speed of computing a
distance matrix is comparable to that of pdist/cdist in
scipy.spatial.distance (a few metrics are slightly faster, a few are
slightly slower). The primary advantage to this approach is the
exposure of underlying C functions which can be easily imported and
called from other cython scripts.
I think there are several other advantages over the current scipy
implementation. Because the new code is pure cython, it would likely be
easier to maintain and to add metrics than the current scipy setup,
which relies on C routines wrapped by-hand using the numpy C-API.
Because all distance functions rely on the same set of underlying cython
routines, there are fewer places for error (for instance, currently the
scipy.spatial.distance boolean routines return different results
depending on whether you call the metrics directly or use cdist/pdist)
I'm curious what people think: could a framework like this replace the
current scipy.spatial.distances implementation? Are there any
disadvantages that I'm not noticing?
More information about the SciPy-Dev