# [Numpy-discussion] Help to process a large data file

orionbelt2@gmai... orionbelt2@gmai...
Thu Oct 2 10:43:37 CDT 2008

```Frank,

I would imagine that you cannot get a much better performance in python
than this, which avoids string conversions:

c = []
count = 0
for line in open('foo'):
if line == '1 1\n':
c.append(count)
count = 0
else:
if '1' in line: count += 1

One could do some numpy trick like:

a = np.sum(a,axis=1)    # Add the two columns horizontally
b = np.where(a==2)[0]   # Find with sum == 2 (1 + 1)
count = []
for i,j in zip(b[:-1],b[1:]):
count.append( a[i+1:j].sum() )  # Calculate number of lines with 1

but on my machine the numpy version takes about 20 sec for a 'foo' file
of 2,500,000 lines versus 1.2 sec for the pure python version...

As a side note, if i replace "line == '1 1\n'" with "line.startswith('1
1')", the pure python version goes up to 1.8 sec... Isn't this a bit
weird, i'd think startswith() should be faster...

Chris

On Wed, Oct 01, 2008 at 07:27:27PM -0600, frank wang wrote:

>    Hi,
>
>    I have a large data file which contains 2 columns of data. The two
>    columns only have zero and one. Now I want to cound how many one in
>    between if both columns are one. For example, if my data is:
>
>    1 0
>    0 0
>    1 1
>    0 0
>    0 1    x
>    0 1    x
>    0 0
>    0 1    x
>    1 1
>    0 0
>    0 1    x
>    0 1    x
>    1 1
>
>    Then my count will be 3 and 2 (the numbers with x).
>
>    Are there an efficient way to do this? My data file is pretty big.
>
>    Thanks
>
>    Frank
```