[Numpy-discussion] Help using numPy to create a very large multi dimensional array
Wed Apr 18 08:59:48 CDT 2007
On 4/18/07, Bruno Santos <email@example.com> wrote:
> Finally I was able to read the data, by using the command you sair with
> some small changes:
> matrix = numpy.array([[float(x) for x in line.split()[1:]] for line in
> But that doesn't solve my speed problem, now instead of taking 40seconds
> in the slow step, takes 1min ant 10seconds :(
> The slow step is this cycle:
> for j in range(0, clust):
> list_j= numpy.asarray(matrix[j])
> for k in range(j+1, clust):
> for e in range(0, columns):
> result = list_j[e] - list_k[e]
> dist += result * result
> if (dist < min):
> ind = j
> ind = k
> min = dist
> I also try with list_j = numpy.array but it only slower even more the
> Does anyone have any ideia how I can speed up this step?
Step 0: think about your alogrithm.
Depending on your data, their are probably faster approaches here. One way I
handled a similar problem at one time was to grid up my space into M chunks
and figure out which vector goes where. If your data is chunky, a dict of
lists can work for this. Then you only need to work your clustering magic on
elements in a given chunk and that's chunks neighbors. This reduces the
problem from N^2 to something like (N/M)^2. For more sophistication you
could also look to the fast multipole solver people for inspiration. I don't
know if they do clustering per se, but it seems likely that all their
hiearchial grouping stuff could be adapted for this.
Step 1: vectorize your innner loop.
Well, that's complicated, so you may want to try something simple first. It
looks like you could benefit from vectorizing at least your innermost loop
and maybe the innermost 2. The innermost could probably be rewritten:
dist = np.dot(list_j, list_l)
Also, all that converting back and forth from matrices is silly. Assuming
you need to convert at all (which you probably don't if you are using dot),
convert just once at the beginning and use the matrix version in the code.
Try that and see if it makes any difference. It would end up somethlng like:
array = numpy.asarray(matrix[j])
for j in range(0, clust):
for k in range(j+1, clust):
dist = numpy.dot(list_j - array[k])
if (dist < min):
ind = j
ind = k
min = dist
Hope that's at least marginally useful. There's enough missing from the
original that it's hard to figure out exactly how this works.
2007/4/18, Christian K. <firstname.lastname@example.org>:
> > Bruno Santos wrote:
> > > I try to use the expression as you said, but I'm not getting the
> > desired
> > > result,
> > > My text file look like this:
> > >
> > > # num rows=115 num columns=2634
> > > AbassiM.txt 0.033023 0.033023 0.033023 0.165115 0.462321....0.000000
> > > AgricoleW.txt 0.038691 0.038691 0.038691 0.232147 0.541676....0.215300
> > > AliR.txt 0.041885 0.041885 0.041885 0.125656 0.586395....0.633580
> > > .....
> > > ....
> > > ....
> > > ZhangJ.txt 0.047189 0.047189 0.047189 0.155048 0.613452....0.000000
> > I guess N.fromfile can't handle non numeric data. Use something like
> > this instead (not tested):
> > import numpy as N
> > data = open('name of file').readlines()
> > data = N.array([[float(x) for x in row.split(' ')[1:]] for row in
> > data[1:]])
> > (the above expression should be one line)
> > Christian
> > _______________________________________________
> > Numpy-discussion mailing list
> > Numpyemail@example.com
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> Numpy-discussion mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Numpy-discussion