[Numpy-discussion] speeding up an array operation
Frédéric Bastien
nouiz@nouiz....
Fri Jul 10 12:57:11 CDT 2009
Can you do it by chunk instead of by row? If the chunk is not too big the
sort could be faster then the access to the multiple dictionnary access. But
don't forget, you change an algo of O(n), by O(nlogn) with a lower constant.
So the n should not be too big. Just try different value.
Frédéric Bastien
On Thu, Jul 9, 2009 at 7:14 AM, Mag Gam <magawake@gmail.com> wrote:
> The problem is the array is very large. We are talking about 200+ million
> rows.
> On Thu, Jul 9, 2009 at 4:41 AM, David Warde-Farley<dwf@cs.toronto.edu>
> wrote:
> > On 9-Jul-09, at 1:12 AM, Mag Gam wrote:
> >> Here is what I have, which does it 1x1:
> >>
> >> z={} #dictionary
> >> r=csv.reader(file)
> >> for i,row in enumerate(r):
> >> p="/MIT/"+row[1]
> >> if p not in z:
> >> z[p]=0:
> >> else:
> >> z[p]+=1
> >> arr[p]['chem'][z[p]]=tuple(row) #this loads the array 1 x 1
> >>
> >> I would like to avoid the 1x1 loading, instead I would like to bulk
> >> load the array. Lets say load up 5million lines into memory and then
> >> push into array. Any ideas on how to do that?
> > Depending on how big your data is, this looks like a job for e.g.
> > numpy.loadtxt(), to give you one big array.
> > Then sort the array on the second column, so that all the rows with
> > the same 'p' appear one after the other. Then you can assign slices of
> > this big array to be arr[p]['chem'].
> >
> > David
