[Numpy-discussion] Help using numPy to create a very large multi dimensional array

Charles R Harris charlesr.harris@gmail....
Fri Apr 13 12:36:59 CDT 2007


On 4/13/07, Bruno Santos <bacmsantos@gmail.com> wrote:
>
> Dear Sirs,
> I'm trying to use Numpy to solve a speed problem with Python, I need to
> perform agglomerative clustering as a first step to k-means clustering.
> My problem is that I'm using a very large list in Pyhton and the script is
> taking more than 9minutes to process all the information, so I'm trying to
> use Numpy to create a matrix.
> I'm reading the vectors from a text file and I end up with an array of
> 115*2634 float elements, How can I create this structure with numpy?
>
> Where is my code in python:
> #Read each document vector to a matrix
>     doclist = []
>     matrix = []
>     list = []
>     for line in vecfile:
>         list = line.split()
>         for elem in range(1, len(list)):
>             list[elem] = float(list[elem])
>         matrix.append (list[1:])
>     vecfile.close()


I don't know what your text file looks like or how many elements are in each
line, but assuming 115 entries/line and spaces, something like the following
will read in the data:

m = N.fromfile('name of text file', sep=' ').reshape(-1,115)

This assumes you have done import numpy as N and will result in a 2634x115
array, which isn't very large.

    #Read the desired number of final clusters
>     numclust = input('Input the desired number of clusters: ')
>
> #Clustering process
>     clust = rows
>     ind = [-1, -1]
>     list_j=[]
>     list_k=[]
>     while (clust > numclust):
>         min = 2147483647
>         print('Number of Clusters %d \n' % clust)
>         #Find the 2 most similares vectors in the file
>         for j in range(0, clust):
>             list_j=matrix[j]
>             for k in range(j+1, clust):
>                 list_k=matrix[k]
>                 dist=0
>                 for e in range(0, columns):
>                     result = list_j[e] - list_k[e]
>                     dist += result * result
>                 if (dist < min):
>                     ind[0] = j
>                     ind[1] = k
>                     min = dist

        #Combine the two most similaires vectores by median
>         for e in range(0, columns): matrix[ind[0]][e] = (matrix[ind[0]][e]
> + matrix[ind[1]][e]) / 2.0
>         clust = clust -1
>
>         #Move up all the remaining vectors
>         for k in range(ind[1], (rows - 1)):
>             for e in range(0, columns): matrix[k][e]=matrix[k+1][e]


This is the slow step, order N^3 in the number of vectors. It can be
vectorized, but perhaps there is a better implementation of this algorithm.
There may be an agglomerative clustering algorithm already available in
scipy, the documentation indicates that kmeans clustering software is
available. Perhaps someone closer to that library can help you there.

Chuck


>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20070413/d876c93b/attachment.html 


More information about the Numpy-discussion mailing list