[SciPy-Dev] scipy.spatial comments
David Warde-Farley
wardefar@iro.umontreal...
Sat Mar 10 19:35:51 CST 2012
On 2012-03-10, at 4:53 AM, Ralf Gommers wrote:
> Second, squared Euclidean distance is computed by taking the square of
> the Euclidean distance. I think it would make more sense to do it the
> other way around: the Euclidean distance is computed by taking the
> square root of the squared Euclidean distance.
>
> Makes sense, should be a little faster.
Actually, the current implementation is absolutely crazy, especially considering that SciPy has easy access to BLAS. One should never be computing Euclidean distances naively like is done in distance.c.
In [36]: def euclidean_distance(X, Y):
....: x2 = (X**2).sum(axis=1)
....: y2 = (Y**2).sum(axis=1)
....: dist = np.dot(X, Y.T)
....: dist *= -2
....: dist += x2[:, np.newaxis]
....: dist += y2
....: return np.sqrt(dist)
....:
In [37]: from scipy.spatial.distance import cdist
In [38]: x = np.random.randn(500, 50)
In [39]: y = np.random.randn(400, 50)
In [40]: timeit euclidean_distance(x, y)
100 loops, best of 3: 7.57 ms per loop
In [41]: timeit cdist(x, y, 'euclidean')
100 loops, best of 3: 16.8 ms per loop
In [42]: np.allclose(euclidean_distance(x, y), cdist(x, y, 'euclidean'))
Out[42]: True
So, a Python implementation that calls NumPy is 2x faster (on my machine with EPD/MKL). A C/Cython implementation that called GEMM directly and didn't iterate over the data multiple times or allocate temporaries would be even faster.
David
More information about the SciPy-Dev
mailing list