[Numpy-discussion] Fastest distance matrix calc
Bill Baxter
wbaxter@gmail....
Tue Apr 17 15:24:37 CDT 2007
Oops. Looks like I forgot to attach the test program that generated
that output so you can tell what dist2g actually does.
Funny thing is -- despite being written in C, hypot doesn't actually
win any of the test cases for which it's applicable.
--bb
On 4/17/07, Bill Baxter <wbaxter@gmail.com> wrote:
> Here's a bunch of dist matrix implementations and their timings.
> The upshot is that for most purposes this seems to be the best or at
> least not too far off (basically the cookbook solution Kier posted)
>
> def dist2hd(x,y):
> """Generate a 'coordinate' of the solution at a time"""
> d = npy.zeros((x.shape[0],y.shape[0]),dtype=x.dtype)
> for i in xrange(x.shape[1]):
> diff2 = x[:,i,None] - y[:,i]
> diff2 **= 2
> d += diff2
> npy.sqrt(d,d)
> return d
>
> The only place where it's far from the best is for a small number of
> points (~10) with high dimensionality (~100), which does come up in
> machine learning contexts. For those cases this does much better
> (factor of :
>
> def dist2b3(x,y):
> d = npy.dot(x,y.T)
> d *= -2.0
> d += (x*x).sum(1)[:,None]
> d += (y*y).sum(1)
> # Rounding errors occasionally cause negative entries in d
> d[d<0] = 0
> # in place sqrt
> npy.sqrt(d,d)
> return d
>
> So given that, the obvious solution (if you don't want to delve into
> non-numpy solutions) is to use a hybrid that just switches between
> the two. Not sure what the proper switch is since it seems kind of
> complicated, and probably depends some on cache specifics. But just
> switching based on the dimension of the points seems to be pretty
> effective:
>
> def dist2hy(x,y):
> if x.shape[1]<5:
> d = npy.zeros((x.shape[0],y.shape[0]),dtype=x.dtype)
> for i in xrange(x.shape[1]):
> diff2 = x[:,i,None] - y[:,i]
> diff2 **= 2
> d += diff2
> npy.sqrt(d,d)
> return d
>
> else:
> d = npy.dot(x,y.T)
> d *= -2.0
> d += (x*x).sum(1)[:,None]
> d += (y*y).sum(1)
> # Rounding errors occasionally cause negative entries in d
> d[d<0] = 0
> # in place sqrt
> npy.sqrt(d,d)
> return d
>
> All of this assumes 'C' contiguous data. All bets are off if you have
> non-contiguous or 'F' ordered data. And maybe if x and y have very
> different numbers of points.
>
>
> --bb
>
>
>
> On 4/17/07, Keir Mierle <mierle@gmail.com> wrote:
> > On 4/13/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
> > > On 4/13/07, Bill Baxter <wbaxter@gmail.com> wrote:
> > > > I think someone posted some timings about this before but I don't recall.
> >
> > http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/498246
> >
> > [snip]
> > > I'm going to go out on a limb and contend, without running any timings, that
> > > for large M and N, a solution using a for loop will beat either of those.
> > > For example (untested):
> > >
> > > results = empty([M, N], float)
> > > # You could be fancy and swap axes depending on which array is larger, but
> > > # I'll leave that for someone else
> > > for i, v in enumerate(x):
> > > results[i] = sqrt(sum((v-y)**2, axis=-1))
> > > Or something like that. The reason that I suspect this will be faster is
> > > that it has better locality, completely finishing a computation on a
> > > relatively small working set before moving onto the next one. The one liners
> > > have to pull the potentially large MxN array into the processor repeatedly.
> >
> > In my experience, it is indeed the case that the for loop version is
> > faster. The fastest of the three versions offered in the above url is
> > the last:
> >
> > from numpy import mat, zeros, newaxis
> > def calcDistanceMatrixFastEuclidean2(nDimPoints):
> > nDimPoints = array(nDimPoints)
> > n,m = nDimPoints.shape
> > delta = zeros((n,n),'d')
> > for d in xrange(m):
> > data = nDimPoints[:,d]
> > delta += (data - data[:,newaxis])**2
> > return sqrt(delta)
> >
> > This is easily extended to two different nDimPoints matricies.
> >
> > Cheers,
> > Keir
> > _______________________________________________
> > Numpy-discussion mailing list
> > Numpy-discussion@scipy.org
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dist2perf.py
Type: text/x-python
Size: 5358 bytes
Desc: not available
Url : http://projects.scipy.org/pipermail/numpy-discussion/attachments/20070418/f0a7b462/attachment-0001.py
More information about the Numpy-discussion
mailing list