Sun May 4 20:53:44 CDT 2008
On Sun, May 4, 2008 at 5:59 PM, Damian R. Eads <email@example.com> wrote:
> Looks like a fun discussion: it's too bad for me I did not join it
> earlier. My first try at scipy-cluster was completely in Python. Like you,
> I also tried to find the most efficient way to transform the distance
> matrix when joining two clusters. Eventually my data sets became big
> enough that I decided to write these parts in C. I don't think my Python
> joining code was efficient as yours.
> I tried out your first test and I am a little confused at the output.
Thank you for catching that bug. It was introduced on the last speed
up. I pasted code from the list that used i as the index. But in the
cluster code I should delete the j-th row and column. After fixing
that the output matches
I should convert that to a unit test.
> You may wish to look at the dendrogram since it is easier to interpret,
> I tried running your second test, and you'll see C might give you a better
> performance speed-up, which is not surprising. Roughly, what I'm doing in
> C is I'm only storing the upper triangular of the distance matrix. An
> array of double*'s (double **) refers to each row of this triangle. To
> eliminate a row, I simply remove the entry in the double ** array. To
> remove a column, I shift the values over in each row of the triangle. I'm
> not sure if this is the best approach but it is certainly more efficient
> than vectorized python, in terms of both memory and computation.
> In : goodman.test2(1000)
> n = 1000 took 22.10 seconds
> In : n=1000
> In : uu=numpy.random.rand(n*(n-1)/2)
> In : tic = time.time(); hcluster.single(uu); toc = time.time(); print
To be (almost) within a factor of 5 is great! But if I get anything
close to useful from my code, I'll move over to your industrial
Here's the bug-fixed version:
import numpy as np
"Single linkage hierarchical clustering"
def __init__(self, dist, label=None, linkage='single', debug=False):
dist Distance matrix, NxN numpy array
label Labels of each row of the distance matrix, list of N items,
default is range(N)
assert dist.shape == dist.shape, 'dist must be square (nxn)'
assert (np.abs(dist - dist.T) < 1e-8).all(), 'dist must be symmetric'
if label is None:
label = range(dist.shape)
assert dist.shape == len(label), 'dist and label must match in size'
msg = 'linkage must be single or complete'
assert linkage in ('single', 'complete'), msg
self.c = [[[z] for z in label]]
self.label = label
self.linkage = linkage
self.dist = dist
self.debug = debug
for level in xrange(len(self.label) - 1):
i, j = self.min_dist()
def join(self, i, j):
assert i != j, 'Cannot combine a cluster with itself'
# Join labels
new = list(self.c[-1])
new[i] = new[i] + new[j]
# Join distance matrix
if self.linkage == 'single':
self.dist[:,i] = self.dist[:,[i,j]].min(1)
elif self.linkage == 'complete':
self.dist[:,i] = self.dist[:,[i,j]].max(1)
raise NotImplementedError, 'Unknown linkage method'
self.dist[i,:] = self.dist[:,i]
# A faster verion of this code...
#idx = range(self.dist.shape)
#self.dist = self.dist[:,idx]
#self.dist = self.dist[idx,:]
# ...is this...
out = self.dist[:-1,:-1]
out[:j,j:] = self.dist[:j,j+1:]
out[j:,:j] = self.dist[j+1:,:j]
out[j:,j:] = self.dist[j+1:,j+1:]
self.dist = out
# Debug output
print len(self.c) - 1
# A faster version of this code...
# dist = self.dist + 1e10 * np.eye(self.dist.shape)
# i, j = np.where(dist == dist.min())
# return i, j
# ...is this:
x = self.dist
N = x.shape
# With complete linkage the min distance was sometimes on the diagonal
# I think it occured on the last merge (one cluster). So I added + 1.
x.flat[::N+1] = x.max() + 1
ij = x.argmin()
i, j = divmod(ij, N)
return i, j
# Example from
label = ['BA', 'FI', 'MI', 'NA', 'RM', 'TO']
dist = np.array([[0, 662, 877, 255, 412, 996],
[662, 0, 295, 468, 268, 400],
[877, 295, 0, 754, 564, 138],
[255, 468, 754, 0, 219, 869],
[412, 268, 564, 219, 0, 669],
[996, 400, 138, 869, 669, 0 ]])
clust = Cluster(dist, label, linkage='single', debug=True)
x = np.random.rand(n,n)
x = x + x.T
c = Cluster(x)
t1 = time.time()
t2 = time.time()
print 'n = %d took %0.2f seconds' % (n, t2-t1)
More information about the Numpy-discussion