[Numpy-discussion] Faster

Damian Eads eads@soe.ucsc....
Sun May 4 16:49:41 CDT 2008


Hi,

Looks like a fun discussion: it's too bad for me I did not join it 
earlier. My first try at scipy-cluster was completely in Python. Like 
you, I also tried to find the most efficient way to transform the 
distance matrix when joining two clusters. Eventually my data sets 
became big enough that I decided to write these parts in C. I don't 
think my Python joining code was efficient as yours.

I tried out your first test and I am a little confused at the output.

In [107]: goodman.test()
1
Clusters
[['BA'], ['FI'], ['MI', 'TO'], ['NA'], ['RM']]
Distance
[[997 662 255 412 996]
  [662 997 468 268 400]
  [255 468 997 219 869]
  [412 268 219 997 669]
  [996 400 869 669 997]]

2
Clusters
[['BA'], ['FI'], ['MI', 'TO', 'NA'], ['RM']]
Distance
[[998 662 412 996]
  [662 998 268 400]
  [412 268 998 669]
  [996 400 669 998]]

3
Clusters
[['BA'], ['FI', 'MI', 'TO', 'NA'], ['RM']]
Distance
[[999 412 996]
  [412 999 669]
  [996 669 999]]

4
Clusters
[['BA', 'FI', 'MI', 'TO', 'NA'], ['RM']]
Distance
[[1000  669]
  [ 669 1000]]

5
Clusters
[['BA', 'FI', 'MI', 'TO', 'NA', 'RM']]
Distance
[[1001]]

The first step is right, singletons 2 and 5 (starting at 0) should be 
joined since they have a minimum distance of 138. Let's look at their 
corresponding rows in the distance matrix.

In [101]: DM[[2,5],:]
Out[101]:
array([[   877.,    295.,  10000.,    754.,    564.,    138.],
        [   996.,    400.,    138.,    869.,    669.,  10000.]])

These two rows, rows 2 and 5, are all that we need to form the row for 
the newly joined cluster in the distance matrix. If we just take the 
minimum for each column we obtain,

In [102]: q=DM[[2,5],:].min(axis=0)
Out[102]: array([ 877.,  295.,  138.,  754.,  564.,  138.])

so the row for the cluster should be the row above with the 2 and 5'th 
row removed. Roughly, there should be a row in the distance matrix with 
the following values but I don't see one in your output.

In [103]: q[q != 138]
Out[103]: array([ 877.,  295.,  754.,  564.])

Since 295 is the minimum distance between this newly joined cluster and 
any other singleton, it should not be chosen for the second iteration 
since singletons 3 and 4 are closer to another with a distance of 219. 
So after iteration 2, you should get [['BA'], ['FI'], ['MI', 'TO'], 
['NA', 'RM']].

Recall that the distance matrix transformation forms a new distance 
matrix using only values from the previous distance matrix. So, at any 
iteration, the values in the distance matrix should be a subset of the 
values in the original distance matrix, eliminating the distance entries 
of the clusters formed.

If we look at the minimum distances in the original distance matrix in 
rank order, we have 138, 219, 255, 268, 295. Thus, we might expect the 
minimum distances found at each iteration to be these values, and they 
are in this case, but I don't have a mathematical proof that it works in 
general. If I run your distance matrix through hcluster.single, I get 
the following linkage matrix. The third column is the distance between 
the clusters joined, and the first two columns are the indices of the 
clusters joined (non-singletons have an index >= n).

array([[   2.,    5.,  138.,    2.],
        [   3.,    4.,  219.,    2.],
        [   0.,    7.,  255.,    3.],
        [   1.,    8.,  268.,    4.],
        [   6.,    9.,  295.,    6.]])

I've attached the dendrogram, since it is easier to interpret.

In [105]: lbls
Out[105]: ['BA', 'FI', 'MI', 'NA', 'RM', 'TO']
In [106]: hcluster.dendrogram(Z, labels=lbls)

I tried running your second test, and you'll see C might give you a 
better performance speed-up (not surprising). Roughly, what I'm doing in 
C is I'm only storing the upper triangular of the distance matrix. An 
array of double*'s (double **) refers to each row of this triangle. To 
eliminate a row, I simply remove the entry in the double ** array. To 
remove a column, I shift the values over in each non-removed row. I'm 
not sure if this is the best approach but it is certainly more efficient 
than what can be achieved in Python.

In [107]: hcluster.goodman.test2(1000)
n = 1000 took 22.10 seconds
In [108]: n=1000
In [109]: uu=numpy.random.rand(n*(n-1)/2)
In [110]: tic = time.time(); hcluster.single(uu); toc = time.time(); 
print toc-tic
Out[110]:
4.57607889175

Damian

Keith Goodman wrote:
> On Fri, May 2, 2008 at 7:25 PM, Robert Kern <robert.kern@gmail.com> wrote:
>>  Assuming x is contiguous and you can modify x in-place:
>>
>>
>>  In [1]: from numpy import *
>>
>>  In [2]: def dist(x):
>>    ...:    x = x + 1e10 * eye(x.shape[0])
>>    ...:    i, j = where(x == x.min())
>>
>>    ...:    return i[0], j[0]
>>    ...:
>>
>>  In [3]: def symmdist(N):
>>    ...:     x = random.rand(N, N)
>>    ...:     x = x + x.T
>>    ...:     x.flat[::N+1] = 0
>>    ...:     return x
>>    ...:
>>
>>  In [4]: symmdist(5)
>>  Out[4]:
>>  array([[ 0.        ,  0.87508654,  1.11691704,  0.80366071,  0.57966808],
>>        [ 0.87508654,  0.        ,  1.5521685 ,  1.74010886,  0.52156877],
>>        [ 1.11691704,  1.5521685 ,  0.        ,  1.22725396,  1.04101992],
>>        [ 0.80366071,  1.74010886,  1.22725396,  0.        ,  1.94577965],
>>        [ 0.57966808,  0.52156877,  1.04101992,  1.94577965,  0.        ]])
>>
>>  In [5]: def kerndist(x):
>>    ...:     N = x.shape[0]
>>    ...:     x.flat[::N+1] = x.max()
>>    ...:     ij = argmin(x.flat)
>>    ...:     i, j = divmod(ij, N)
>>    ...:     return i, j
>>    ...:
>>
>>  In [10]: x = symmdist(500)
>>
>>  In [15]: %timeit dist(x)
>>  10 loops, best of 3: 19.9 ms per loop
>>
>>  In [16]: %timeit kerndist(x)
>>  100 loops, best of 3: 4.38 ms per loop
> 
> I added
> 
> i, j = divmod(x.argmin(), x.shape[0])
> 
> to
> 
> http://scipy.org/PerformanceTips
-------------- next part --------------
A non-text attachment was scrubbed...
Name: example.png
Type: image/png
Size: 11623 bytes
Desc: not available
Url : http://projects.scipy.org/pipermail/numpy-discussion/attachments/20080504/4f8b7fe9/attachment-0001.png 


More information about the Numpy-discussion mailing list