[Numpy-discussion] Fastest distance matrix calc
Fri Apr 13 08:49:00 CDT 2007
On 4/13/07, Bill Baxter <firstname.lastname@example.org> wrote:
> I think someone posted some timings about this before but I don't recall.
> The task is to compute the matrix D from two sets of vectors x (M,d)
> and y (N,d).
> The output should be D where D[i,j] is norm(x[i]-y[j])
> The Matlab NetLab toolkit uses something like this to compute it:
> d2 = (x*x).sum(1)[:,numpy.newaxis] + (y*y).sum(1) - 2*mult(x,y.T)
> And then does
> because roundoff in the above can sometimes create negative values. And
> d = sqrt(d2)
> But it just doesn't seem like the best way to do it. Whatever was
> posted before I don't remember anything about a subtract.outer
> solution. Seems like something along the lines of
> might be faster, and also avoid the need to check for negatives.
> I'll do some timings if no one's already done it. Just wanted to check
I'm going to go out on a limb and contend, without running any timings, that
for large M and N, a solution using a for loop will beat either of those.
For example (untested):
results = empty([M, N], float)
# You could be fancy and swap axes depending on which array is larger, but
# I'll leave that for someone else
for i, v in enumerate(x):
results[i] = sqrt(sum((v-y)**2, axis=-1))
Or something like that. The reason that I suspect this will be faster is
that it has better locality, completely finishing a computation on a
relatively small working set before moving onto the next one. The one liners
have to pull the potentially large MxN array into the processor repeatedly.
Or you could use numexpr.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Numpy-discussion