# [Numpy-discussion] Help using numPy to create a very large multi dimensional array

Charles R Harris charlesr.harris@gmail....
Fri Apr 13 12:36:59 CDT 2007

```On 4/13/07, Bruno Santos <bacmsantos@gmail.com> wrote:
>
> Dear Sirs,
> I'm trying to use Numpy to solve a speed problem with Python, I need to
> perform agglomerative clustering as a first step to k-means clustering.
> My problem is that I'm using a very large list in Pyhton and the script is
> taking more than 9minutes to process all the information, so I'm trying to
> use Numpy to create a matrix.
> I'm reading the vectors from a text file and I end up with an array of
> 115*2634 float elements, How can I create this structure with numpy?
>
> Where is my code in python:
> #Read each document vector to a matrix
>     doclist = []
>     matrix = []
>     list = []
>     for line in vecfile:
>         list = line.split()
>         for elem in range(1, len(list)):
>             list[elem] = float(list[elem])
>         matrix.append (list[1:])
>     vecfile.close()

I don't know what your text file looks like or how many elements are in each
line, but assuming 115 entries/line and spaces, something like the following

m = N.fromfile('name of text file', sep=' ').reshape(-1,115)

This assumes you have done import numpy as N and will result in a 2634x115
array, which isn't very large.

#Read the desired number of final clusters
>     numclust = input('Input the desired number of clusters: ')
>
> #Clustering process
>     clust = rows
>     ind = [-1, -1]
>     list_j=[]
>     list_k=[]
>     while (clust > numclust):
>         min = 2147483647
>         print('Number of Clusters %d \n' % clust)
>         #Find the 2 most similares vectors in the file
>         for j in range(0, clust):
>             list_j=matrix[j]
>             for k in range(j+1, clust):
>                 list_k=matrix[k]
>                 dist=0
>                 for e in range(0, columns):
>                     result = list_j[e] - list_k[e]
>                     dist += result * result
>                 if (dist < min):
>                     ind[0] = j
>                     ind[1] = k
>                     min = dist

#Combine the two most similaires vectores by median
>         for e in range(0, columns): matrix[ind[0]][e] = (matrix[ind[0]][e]
> + matrix[ind[1]][e]) / 2.0
>         clust = clust -1
>
>         #Move up all the remaining vectors
>         for k in range(ind[1], (rows - 1)):
>             for e in range(0, columns): matrix[k][e]=matrix[k+1][e]

This is the slow step, order N^3 in the number of vectors. It can be
vectorized, but perhaps there is a better implementation of this algorithm.
There may be an agglomerative clustering algorithm already available in
scipy, the documentation indicates that kmeans clustering software is