[Numpy-discussion] Faster
Damian Eads
eads@soe.ucsc....
Sun May 4 16:49:41 CDT 2008
Hi,
Looks like a fun discussion: it's too bad for me I did not join it
earlier. My first try at scipy-cluster was completely in Python. Like
you, I also tried to find the most efficient way to transform the
distance matrix when joining two clusters. Eventually my data sets
became big enough that I decided to write these parts in C. I don't
think my Python joining code was efficient as yours.
I tried out your first test and I am a little confused at the output.
In [107]: goodman.test()
1
Clusters
[['BA'], ['FI'], ['MI', 'TO'], ['NA'], ['RM']]
Distance
[[997 662 255 412 996]
[662 997 468 268 400]
[255 468 997 219 869]
[412 268 219 997 669]
[996 400 869 669 997]]
2
Clusters
[['BA'], ['FI'], ['MI', 'TO', 'NA'], ['RM']]
Distance
[[998 662 412 996]
[662 998 268 400]
[412 268 998 669]
[996 400 669 998]]
3
Clusters
[['BA'], ['FI', 'MI', 'TO', 'NA'], ['RM']]
Distance
[[999 412 996]
[412 999 669]
[996 669 999]]
4
Clusters
[['BA', 'FI', 'MI', 'TO', 'NA'], ['RM']]
Distance
[[1000 669]
[ 669 1000]]
5
Clusters
[['BA', 'FI', 'MI', 'TO', 'NA', 'RM']]
Distance
[[1001]]
The first step is right, singletons 2 and 5 (starting at 0) should be
joined since they have a minimum distance of 138. Let's look at their
corresponding rows in the distance matrix.
In [101]: DM[[2,5],:]
Out[101]:
array([[ 877., 295., 10000., 754., 564., 138.],
[ 996., 400., 138., 869., 669., 10000.]])
These two rows, rows 2 and 5, are all that we need to form the row for
the newly joined cluster in the distance matrix. If we just take the
minimum for each column we obtain,
In [102]: q=DM[[2,5],:].min(axis=0)
Out[102]: array([ 877., 295., 138., 754., 564., 138.])
so the row for the cluster should be the row above with the 2 and 5'th
row removed. Roughly, there should be a row in the distance matrix with
the following values but I don't see one in your output.
In [103]: q[q != 138]
Out[103]: array([ 877., 295., 754., 564.])
Since 295 is the minimum distance between this newly joined cluster and
any other singleton, it should not be chosen for the second iteration
since singletons 3 and 4 are closer to another with a distance of 219.
So after iteration 2, you should get [['BA'], ['FI'], ['MI', 'TO'],
['NA', 'RM']].
Recall that the distance matrix transformation forms a new distance
matrix using only values from the previous distance matrix. So, at any
iteration, the values in the distance matrix should be a subset of the
values in the original distance matrix, eliminating the distance entries
of the clusters formed.
If we look at the minimum distances in the original distance matrix in
rank order, we have 138, 219, 255, 268, 295. Thus, we might expect the
minimum distances found at each iteration to be these values, and they
are in this case, but I don't have a mathematical proof that it works in
general. If I run your distance matrix through hcluster.single, I get
the following linkage matrix. The third column is the distance between
the clusters joined, and the first two columns are the indices of the
clusters joined (non-singletons have an index >= n).
array([[ 2., 5., 138., 2.],
[ 3., 4., 219., 2.],
[ 0., 7., 255., 3.],
[ 1., 8., 268., 4.],
[ 6., 9., 295., 6.]])
I've attached the dendrogram, since it is easier to interpret.
In [105]: lbls
Out[105]: ['BA', 'FI', 'MI', 'NA', 'RM', 'TO']
In [106]: hcluster.dendrogram(Z, labels=lbls)
I tried running your second test, and you'll see C might give you a
better performance speed-up (not surprising). Roughly, what I'm doing in
C is I'm only storing the upper triangular of the distance matrix. An
array of double*'s (double **) refers to each row of this triangle. To
eliminate a row, I simply remove the entry in the double ** array. To
remove a column, I shift the values over in each non-removed row. I'm
not sure if this is the best approach but it is certainly more efficient
than what can be achieved in Python.
In [107]: hcluster.goodman.test2(1000)
n = 1000 took 22.10 seconds
In [108]: n=1000
In [109]: uu=numpy.random.rand(n*(n-1)/2)
In [110]: tic = time.time(); hcluster.single(uu); toc = time.time();
print toc-tic
Out[110]:
4.57607889175
Damian
Keith Goodman wrote:
> On Fri, May 2, 2008 at 7:25 PM, Robert Kern <robert.kern@gmail.com> wrote:
>> Assuming x is contiguous and you can modify x in-place:
>>
>>
>> In [1]: from numpy import *
>>
>> In [2]: def dist(x):
>> ...: x = x + 1e10 * eye(x.shape[0])
>> ...: i, j = where(x == x.min())
>>
>> ...: return i[0], j[0]
>> ...:
>>
>> In [3]: def symmdist(N):
>> ...: x = random.rand(N, N)
>> ...: x = x + x.T
>> ...: x.flat[::N+1] = 0
>> ...: return x
>> ...:
>>
>> In [4]: symmdist(5)
>> Out[4]:
>> array([[ 0. , 0.87508654, 1.11691704, 0.80366071, 0.57966808],
>> [ 0.87508654, 0. , 1.5521685 , 1.74010886, 0.52156877],
>> [ 1.11691704, 1.5521685 , 0. , 1.22725396, 1.04101992],
>> [ 0.80366071, 1.74010886, 1.22725396, 0. , 1.94577965],
>> [ 0.57966808, 0.52156877, 1.04101992, 1.94577965, 0. ]])
>>
>> In [5]: def kerndist(x):
>> ...: N = x.shape[0]
>> ...: x.flat[::N+1] = x.max()
>> ...: ij = argmin(x.flat)
>> ...: i, j = divmod(ij, N)
>> ...: return i, j
>> ...:
>>
>> In [10]: x = symmdist(500)
>>
>> In [15]: %timeit dist(x)
>> 10 loops, best of 3: 19.9 ms per loop
>>
>> In [16]: %timeit kerndist(x)
>> 100 loops, best of 3: 4.38 ms per loop
>
> I added
>
> i, j = divmod(x.argmin(), x.shape[0])
>
> to
>
> http://scipy.org/PerformanceTips
-------------- next part --------------
A non-text attachment was scrubbed...
Name: example.png
Type: image/png
Size: 11623 bytes
Desc: not available
Url : http://projects.scipy.org/pipermail/numpy-discussion/attachments/20080504/4f8b7fe9/attachment-0001.png
More information about the Numpy-discussion
mailing list