[Numpy-discussion] Fastest distance matrix calc

Timothy Hochberg tim.hochberg@ieee....
Fri Apr 13 08:49:00 CDT 2007

On 4/13/07, Bill Baxter <wbaxter@gmail.com> wrote:
> I think someone posted some timings about this before but I don't recall.
> The task is to compute the matrix D from two sets of vectors x (M,d)
> and y (N,d).
> The output should be  D where D[i,j] is norm(x[i]-y[j])
> The Matlab NetLab toolkit uses something like this to compute it:
>     d2 = (x*x).sum(1)[:,numpy.newaxis] + (y*y).sum(1) - 2*mult(x,y.T)
> And then does
>   d2[d2<0]=0
> because roundoff in the above can sometimes create negative values.  And
> finally
>   d = sqrt(d2)
> But it just doesn't seem like the best way to do it.  Whatever was
> posted before I don't remember anything about a subtract.outer
> solution.  Seems like something along the lines of
>      (subtract.outer(x,y)**2).sum(axis=0)
> might be faster, and also avoid the need to check for negatives.
> I'll do some timings if no one's already done it.  Just wanted to check
> first.

I'm going to go out on a limb and contend, without running any timings, that
for large M and N, a solution using a for loop will beat either of those.
For example (untested):

results = empty([M, N], float)
# You could be fancy and swap axes depending on which array is larger, but
# I'll leave that for someone else
for i, v in enumerate(x):
    results[i] = sqrt(sum((v-y)**2, axis=-1))

Or something like that. The reason that I suspect this will be faster is
that it has better locality, completely finishing a computation on a
relatively small working set before moving onto the next one. The one liners
have to pull the potentially large MxN array into the processor repeatedly.

Or you could use numexpr.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20070413/88ed0cd9/attachment.html 

More information about the Numpy-discussion mailing list