[Numpy-discussion] Distance Matrix speed
Tim Hochberg
tim.hochberg at cox.net
Mon Jun 19 16:39:14 CDT 2006
Tim Hochberg wrote:
>Sebastian Beca wrote:
>
>
>
>>I just ran Alan's script and I don't get consistent results for 100
>>repetitions. I boosted it to 1000, and ran it several times. The
>>faster one varied alot, but both came into a ~ +-1.5% difference.
>>
>>When it comes to scaling, for my problem(fuzzy clustering), N is the
>>size of the dataset, which should span from thousands to millions. C
>>is the amount of clusters, usually less than 10, and K the amount of
>>features (the dimension I want to sum over) is also usually less than
>>100. So mainly I'm concerned with scaling across N. I tried C=3, K=4,
>>N=1000, 2500, 5000, 7500, 10000. Also using 1000 runs, the results
>>were:
>>dist_beca: 1.1, 4.5, 16, 28, 37
>>dist_loehner1: 1.7, 6.5, 22, 35, 47
>>
>>I also tried scaling across K, with C=3, N=2500, and K=5-50. I
>>couldn't get any consistent results for small K, but both tend to
>>perform as well (+-2%) for large K (K>15).
>>
>>I'm not sure how these work in the backend so I can't argument as to
>>why one should scale better than the other.
>>
>>
>>
>>
>The reason I suspect that dist_beca should scale better is that
>dist_loehner1 generates an intermediate array of size NxCxK, while
>dist_beca produces intermediate matrices that are only NxK or CxK. For
>large problems, allocating that extra memory and fetching it into and
>out of the cache can be a bottleneck.
>
>Here's another version that allocates even less in the way of
>temporaries at the expenses of being borderline incomprehensible. It
>still allocates an NxK temporary array, but it allocates it once ahead
>of time and then reuses it for all subsequent calculations. Your welcome
>to use it, but I'm not sure I'd recomend it unless this function is
>really a speed bottleneck as it could end up being hard to read later (I
>left implementing the N<C case as an exercise to the reader....).
>
>I have another idea that might reduce the memory overhead still further,
>if I get a chance I'll try it out and let you know if it results in a
>further speed up.
>
>-tim
>
>
> def dist2(A, B):
> d = zeros([N, C], dtype=float)
> if N < C:
> raise NotImplemented
> else:
> tmp = empty([N, K], float)
> tmp0 = tmp[:,0]
> rangek = range(1,K)
> for j in range(C):
> subtract(A, B[j], tmp)
> tmp *= tmp
> for k in rangek:
> tmp0 += tmp[:,k]
> sqrt(tmp0, d[:,j])
> return d
>
>
Speaking of scaling: I tried this with K=25000 (10 x greater than
Sebastian's original numbers). Much to my suprise it performed somewhat
worse than the Sebastian's dist() with large K. Below is a modified
dist2 that performs about the same (marginally better here) for large K
as well as a dist3 that performs about 50% better at both K=2500 and
K=25000.
-tim
def dist2(A, B):
d = empty([N, C], dtype=float)
if N < C:
raise NotImplemented
else:
tmp = empty([N, K], float)
tmp0 = tmp[:,0]
for j in range(C):
subtract(A, B[j], tmp)
tmp **= 2
d[:,j] = sum(tmp, axis=1)
sqrt(d[:,j], d[:,j])
return d
def dist3(A, B):
d = zeros([N, C], dtype=float)
rangek = range(K)
if N < C:
raise NotImplemented
else:
tmp = empty([N], float)
for j in range(C):
for k in rangek:
subtract(A[:,k], B[j,k], tmp)
tmp **= 2
d[:,j] += tmp
sqrt(d[:,j], d[:,j])
return d
>
>
>
>>Regards,
>>
>>Sebastian.
>>
>>On 6/19/06, Alan G Isaac <aisaac at american.edu> wrote:
>>
>>
>>
>>
>>>On Sun, 18 Jun 2006, Tim Hochberg apparently wrote:
>>>
>>>
>>>
>>>
>>>
>>>>Alan G Isaac wrote:
>>>>
>>>>
>>>>
>>>>
>>>>>On Sun, 18 Jun 2006, Sebastian Beca apparently wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>def dist():
>>>>>>d = zeros([N, C], dtype=float)
>>>>>>if N < C: for i in range(N):
>>>>>>xy = A[i] - B d[i,:] = sqrt(sum(xy**2, axis=1))
>>>>>>return d
>>>>>>else:
>>>>>>for j in range(C):
>>>>>>xy = A - B[j] d[:,j] = sqrt(sum(xy**2, axis=1))
>>>>>>return d
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>But that is 50% slower than Johannes's version:
>>>>>
>>>>>
>>>>>def dist_loehner1():
>>>>> d = A[:, newaxis, :] - B[newaxis, :, :]
>>>>> d = sqrt((d**2).sum(axis=2))
>>>>> return d
>>>>>
>>>>>
>>>>>
>>>>>
>>>>Are you sure about that? I just ran it through timeit, using Sebastian's
>>>>array sizes and I get Sebastian's version being 150% faster. This
>>>>could well be cache size dependant, so may vary from box to box, but I'd
>>>>expect Sebastian's current version to scale better in general.
>>>>
>>>>
>>>>
>>>>
>>>No, I'm not sure.
>>>Script attached bottom.
>>>Most recent output follows:
>>>for reasons I have not determined,
>>>it doesn't match my previous runs ...
>>>Alan
>>>
>>>
>>>
>>>
>>>
>>>>>>execfile(r'c:\temp\temp.py')
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>dist_beca : 3.042277
>>>dist_loehner1: 3.170026
>>>
>>>
>>>#################################
>>>#THE SCRIPT
>>>import sys
>>>sys.path.append("c:\\temp")
>>>import numpy
>>>
>>>
>>>from numpy import *
>>
>>
>>>import timeit
>>>
>>>
>>>K = 10
>>>C = 2500
>>>N = 3 # One could switch around C and N now.
>>>A = numpy.random.random( [N, K] )
>>>B = numpy.random.random( [C, K] )
>>>
>>># beca
>>>def dist_beca():
>>> d = zeros([N, C], dtype=float)
>>> if N < C:
>>> for i in range(N):
>>> xy = A[i] - B
>>> d[i,:] = sqrt(sum(xy**2, axis=1))
>>> return d
>>> else:
>>> for j in range(C):
>>> xy = A - B[j]
>>> d[:,j] = sqrt(sum(xy**2, axis=1))
>>> return d
>>>
>>>#loehnert
>>>def dist_loehner1():
>>> # drawback: memory usage temporarily doubled
>>> # solution see below
>>> d = A[:, newaxis, :] - B[newaxis, :, :]
>>> # written as 3 expressions for more clarity
>>> d = sqrt((d**2).sum(axis=2))
>>> return d
>>>
>>>
>>>if __name__ == "__main__":
>>> t1 = timeit.Timer('dist_beca()', 'from temp import dist_beca').timeit(100)
>>> t8 = timeit.Timer('dist_loehner1()', 'from temp import dist_loehner1').timeit(100)
>>> fmt="%-10s:\t"+"%10.6f"
>>> print fmt%('dist_beca', t1)
>>> print fmt%('dist_loehner1', t8)
>>>
>>>
>>>
>>>
>>>_______________________________________________
>>>Numpy-discussion mailing list
>>>Numpy-discussion at lists.sourceforge.net
>>>https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>>>
>>>
>>>
>>>
>>>
>>_______________________________________________
>>Numpy-discussion mailing list
>>Numpy-discussion at lists.sourceforge.net
>>https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>>
>>
>>
>>
>>
>>
>
>
>
>
>_______________________________________________
>Numpy-discussion mailing list
>Numpy-discussion at lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>
>
>
>
More information about the Numpy-discussion
mailing list