[Numpy-discussion] Help to process a large data file
David Huard
david.huard@gmail....
Fri Oct 3 08:48:52 CDT 2008
Frank, On Thu, Oct 2, 2008 at 3:20 PM, frank wang <f.yw@hotmail.com> wrote:
> Thans David and Chris for providing the nice solution.
Glad it helped.
> Both method works gread. I could not tell the speed difference between the
> two solutions. My data size is 1048577 lines.
I'd be curious to know what happens for larger files (~ 10 M lines). I'd
guess Chris solution would be the fastest since it works incrementally and
does not load the entire data in memory. If you ever try, I'll be
interested to know how it turns out.
David
> I did not try the second solution from Chris since it is too slow as Chris
> stated.
> Frank
> > Frank,
> >
> > I would imagine that you cannot get a much better performance in python
> > than this, which avoids string conversions:
> >
> > c = []
> > count = 0
> > for line in open('foo'):
> > if line == '1 1\n':
> > c.append(count)
> > count = 0
> > else:
> > if '1' in line: count += 1
> > One could do some numpy trick like:
> >
> > a = np.loadtxt('foo',dtype=int)
> > a = np.sum(a,axis=1) # Add the two columns horizontally
> > b = np.where(a==2)[0] # Find with sum == 2 (1 + 1)
> > count = []
> > for i,j in zip(b[:-1],b[1:]):
> > count.append( a[i+1:j].sum() ) # Calculate number of lines with 1
> >
> > but on my machine the numpy version takes about 20 sec for a 'foo' file
> > of 2,500,000 lines versus 1.2 sec for the pure python version...
> >
> > As a side note, if i replace "line == '1 1\n'" with "line.startswith('1
> > 1')", the pure python version goes up to 1.8 sec... Isn't this a bit
> > weird, i'd think startswith() should be faster...
> > Chris
> >
> > On Wed, Oct 01, 2008 at 07:27:27PM -0600, frank wang wrote:
> >
> > > Hi,
> > >
> > > I have a large data file which contains 2 columns of data. The two
> > > columns only have zero and one. Now I want to cound how many one in
> > > between if both columns are one. For example, if my data is:
> > >
> > > 1 0
> > > 0 0
> > > 1 1
> > > 0 0
> > > 0 1 x
> > > 0 1 x
> > > 0 0
> > > 0 1 x
> > > 1 1
> > > 0 0
> > > 0 1 x
> > > 0 1 x
> > > 1 1
> > > Then my count will be 3 and 2 (the numbers with x).
> > >
> > > Are there an efficient way to do this? My data file is pretty big.
> > >
> > > Thanks
> > >
> > > Frank
