[Numpy-discussion] Distance Matrix speed

Sebastian Beca sebastian.beca at gmail.com
Mon Jun 19 15:04:31 CDT 2006


I just ran Alan's script and I don't get consistent results for 100
repetitions. I boosted it to 1000, and ran it several times. The
faster one varied alot, but both came into a  ~ +-1.5% difference.

When it comes to scaling, for my problem(fuzzy clustering), N is the
size of the dataset, which should span from thousands to millions. C
is the amount of clusters, usually less than 10, and K the amount of
features (the dimension I want to sum over) is also usually less than
100. So mainly I'm concerned with scaling across N. I tried C=3, K=4,
N=1000, 2500, 5000, 7500, 10000. Also using 1000 runs, the results
were:
dist_beca: 1.1, 4.5, 16, 28, 37
dist_loehner1: 1.7, 6.5, 22, 35, 47

I also tried scaling across K, with C=3, N=2500, and K=5-50. I
couldn't get any consistent results for small K, but both tend to
perform as well (+-2%) for large K (K>15).

I'm not sure how these work in the backend so I can't argument as to
why one should scale better than the other.
Regards,

Sebastian.

On 6/19/06, Alan G Isaac <aisaac at american.edu> wrote:
> On Sun, 18 Jun 2006, Tim Hochberg apparently wrote:
>
> > Alan G Isaac wrote:
>
> >> On Sun, 18 Jun 2006, Sebastian Beca apparently wrote:
>
> >>> def dist():
> >>> d = zeros([N, C], dtype=float)
> >>> if N < C: for i in range(N):
> >>> xy = A[i] - B d[i,:] = sqrt(sum(xy**2, axis=1))
> >>> return d
> >>> else:
> >>> for j in range(C):
> >>> xy = A - B[j] d[:,j] = sqrt(sum(xy**2, axis=1))
> >>> return d
>
> >> But that is 50% slower than Johannes's version:
>
> >> def dist_loehner1():
> >>        d = A[:, newaxis, :] - B[newaxis, :, :]
> >>        d = sqrt((d**2).sum(axis=2))
> >>      return d
>
> > Are you sure about that? I just ran it through timeit, using Sebastian's
> > array sizes and I get Sebastian's version being 150% faster. This
> > could well be cache size dependant, so may vary from box to box, but I'd
> > expect Sebastian's current version to scale better in general.
>
> No, I'm not sure.
> Script attached bottom.
> Most recent output follows:
> for reasons I have not determined,
> it doesn't match my previous runs ...
> Alan
>
> >>> execfile(r'c:\temp\temp.py')
> dist_beca :       3.042277
> dist_loehner1:    3.170026
>
>
> #################################
> #THE SCRIPT
> import sys
> sys.path.append("c:\\temp")
> import numpy
> from numpy import *
> import timeit
>
>
> K = 10
> C = 2500
> N = 3 # One could switch around C and N now.
> A = numpy.random.random( [N, K] )
> B = numpy.random.random( [C, K] )
>
> # beca
> def dist_beca():
>     d = zeros([N, C], dtype=float)
>     if N < C:
>         for i in range(N):
>             xy = A[i] - B
>             d[i,:] = sqrt(sum(xy**2, axis=1))
>         return d
>     else:
>         for j in range(C):
>             xy = A - B[j]
>             d[:,j] = sqrt(sum(xy**2, axis=1))
>     return d
>
> #loehnert
> def dist_loehner1():
>         # drawback: memory usage temporarily doubled
>         # solution see below
>         d = A[:, newaxis, :] - B[newaxis, :, :]
>         # written as 3 expressions for more clarity
>         d = sqrt((d**2).sum(axis=2))
>         return d
>
>
> if __name__ == "__main__":
>         t1 = timeit.Timer('dist_beca()', 'from temp import dist_beca').timeit(100)
>         t8 = timeit.Timer('dist_loehner1()', 'from temp import dist_loehner1').timeit(100)
>         fmt="%-10s:\t"+"%10.6f"
>         print fmt%('dist_beca', t1)
>         print fmt%('dist_loehner1', t8)
>
>
>
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>




More information about the Numpy-discussion mailing list