[Numpy-discussion] Faster
Damian R. Eads
eads@lanl....
Sun May 4 19:59:17 CDT 2008
Hi,
Looks like a fun discussion: it's too bad for me I did not join it
earlier. My first try at scipy-cluster was completely in Python. Like you,
I also tried to find the most efficient way to transform the distance
matrix when joining two clusters. Eventually my data sets became big
enough that I decided to write these parts in C. I don't think my Python
joining code was efficient as yours.
I tried out your first test and I am a little confused at the output.
In [107]: goodman.test()
1
Clusters
[['BA'], ['FI'], ['MI', 'TO'], ['NA'], ['RM']]
Distance
[[997 662 255 412 996]
[662 997 468 268 400]
[255 468 997 219 869]
[412 268 219 997 669]
[996 400 869 669 997]]
2
Clusters
[['BA'], ['FI'], ['MI', 'TO', 'NA'], ['RM']]
Distance
[[998 662 412 996]
[662 998 268 400]
[412 268 998 669]
[996 400 669 998]]
3
Clusters
[['BA'], ['FI', 'MI', 'TO', 'NA'], ['RM']]
Distance
[[999 412 996]
[412 999 669]
[996 669 999]]
4
Clusters
[['BA', 'FI', 'MI', 'TO', 'NA'], ['RM']]
Distance
[[1000 669]
[ 669 1000]]
5
Clusters
[['BA', 'FI', 'MI', 'TO', 'NA', 'RM']]
Distance
[[1001]]
The first step is right, singletons 2 and 5 (starting at 0) should be
joined since they have a minimum distance of 138. Let's look at their
corresponding rows in the distance matrix.
In [101]: DM[[2,5],:]
Out[101]:
array([[ 877., 295., 10000., 754., 564., 138.],
[ 996., 400., 138., 869., 669., 10000.]])
These two rows, rows 2 and 5, are all that we need to form the row for the
newly joined cluster in the distance matrix. If we just take the minimum
for each column we obtain,
In [102]: q=DM[[2,5],:].min(axis=0)
Out[102]: array([ 877., 295., 138., 754., 564., 138.])
so the row for the cluster should be the row above with the 2 and 5'th row
removed. Roughly, there should be a row in the distance matrix with the
following values but I don't see them in your output of the transformed
distance matrix.
In [103]: q[q != 138]
Out[103]: array([ 877., 295., 754., 564.])
Since 295 is the minimum distance between this newly joined cluster and
any other singleton, it should not be chosen for the second iteration
since singletons 3 and 4 are closer to another with a distance of 219. So
after iteration 2, you should get [['BA'], ['FI'], ['MI', 'TO'], ['NA',
'RM']].
Recall that the distance matrix transformation forms a new distance matrix
using only values from the previous distance matrix. So, at any iteration,
the values in the distance matrix should be a subset of the values in the
original distance matrix, eliminating the distance entries of the clusters
formed.
If we look at the minimum distances in the original distance matrix in
rank order, we have 138, 219, 255, 268, 295. Thus, we might expect the
minimum distances found at each iteration to be these values, and they are
in this case, but I don't have a mathematical proof that it works in
general. If I run your distance matrix through hcluster.single, I get the
following linkage matrix. Each row represents a non-singleton cluster. The
first two columns are the indices of the clusters joined (non-singletons
have an index >= n), the third column is their distance, and the fourth,
the number of singletons belonging to the cluster.
array([[ 2., 5., 138., 2.],
[ 3., 4., 219., 2.],
[ 0., 7., 255., 3.],
[ 1., 8., 268., 4.],
[ 6., 9., 295., 6.]])
You may wish to look at the dendrogram since it is easier to interpret,
http://www.soe.ucsc.edu/~eads/goodman-dendro.png
In [105]: lbls
Out[105]: ['BA', 'FI', 'MI', 'NA', 'RM', 'TO']
In [106]: hcluster.dendrogram(Z, labels=lbls)
FWIW, MATLAB returns equivalent output,
>> linkage(dm2, 'single')
ans =
3 6 138
4 5 219
1 8 255
2 9 268
7 10 295
I tried running your second test, and you'll see C might give you a better
performance speed-up, which is not surprising. Roughly, what I'm doing in
C is I'm only storing the upper triangular of the distance matrix. An
array of double*'s (double **) refers to each row of this triangle. To
eliminate a row, I simply remove the entry in the double ** array. To
remove a column, I shift the values over in each row of the triangle. I'm
not sure if this is the best approach but it is certainly more efficient
than vectorized python, in terms of both memory and computation.
In [107]: goodman.test2(1000)
n = 1000 took 22.10 seconds
In [108]: n=1000
In [109]: uu=numpy.random.rand(n*(n-1)/2)
In [110]: tic = time.time(); hcluster.single(uu); toc = time.time(); print
toc-tic
Out[110]:
4.57607889175
Damian
Keith Goodman wrote:
> On Fri, May 2, 2008 at 7:25 PM, Robert Kern <robert.kern@gmail.com> wrote:
>> Assuming x is contiguous and you can modify x in-place:
>>
>>
>> In [1]: from numpy import *
>>
>> In [2]: def dist(x):
>> ...: x = x + 1e10 * eye(x.shape[0])
>> ...: i, j = where(x == x.min())
>>
>> ...: return i[0], j[0]
>> ...:
>>
>> In [3]: def symmdist(N):
>> ...: x = random.rand(N, N)
>> ...: x = x + x.T
>> ...: x.flat[::N+1] = 0
>> ...: return x
>> ...:
>>
>> In [4]: symmdist(5)
>> Out[4]:
>> array([[ 0. , 0.87508654, 1.11691704, 0.80366071, 0.57966808],
>> [ 0.87508654, 0. , 1.5521685 , 1.74010886, 0.52156877],
>> [ 1.11691704, 1.5521685 , 0. , 1.22725396, 1.04101992],
>> [ 0.80366071, 1.74010886, 1.22725396, 0. , 1.94577965],
>> [ 0.57966808, 0.52156877, 1.04101992, 1.94577965, 0. ]])
>>
>> In [5]: def kerndist(x):
>> ...: N = x.shape[0]
>> ...: x.flat[::N+1] = x.max()
>> ...: ij = argmin(x.flat)
>> ...: i, j = divmod(ij, N)
>> ...: return i, j
>> ...:
>>
>> In [10]: x = symmdist(500)
>>
>> In [15]: %timeit dist(x)
>> 10 loops, best of 3: 19.9 ms per loop
>>
>> In [16]: %timeit kerndist(x)
>> 100 loops, best of 3: 4.38 ms per loop
>
> I added
>
> i, j = divmod(x.argmin(), x.shape[0])
>
> to
>
> http://scipy.org/PerformanceTips
More information about the Numpy-discussion
mailing list