[SciPy-user] handling of huge files for post-processing

Christoph Scheit Christoph.Scheit@lstm.uni-erlangen...
Tue Feb 26 09:27:42 CST 2008

Hello David,

indeed data in file a depends on data in file b...
that the biggest problem and consequently
I guess I need something that operates better
on the file-system than in main memory.

Do you think, it's possible to use PyTables to
tackle the problem? I would need something
that can group together such enormous
data-sets. sqlite is nice to group data of
a table together, but I guess my data-sets are
just to big...

Acutally I unfortunately don't see the possibility
to iterate over the entries of the files in the
manner you described below.... 



Message: 3
Date: Tue, 26 Feb 2008 09:17:00 -0500
From: "David Huard" <david.huard@gmail.com>
Subject: Re: [SciPy-user] handling of huge files for post-processing
To: "SciPy Users List" <scipy-user@scipy.org>
Content-Type: text/plain; charset="iso-8859-1"


Do you mean that b depends on the entire dataset a ? In this case, you might
consider buying additional memory; this is often way cheaper in terms of
time than trying to optimize the code.

What I mean by iterators is that when you open a binary file, you generally
have the possibility to iterate over each element in the file. For instance,
when reading an ascii file:

for line in f.readline():
    some operation on the current line.

instead of loading all the file in memory:
lines = f.readlines()

This way, only one line is kept in memory at a time. If you can write your
code in this manner, this might solve your memory problem. For instance,
here is a generator that opens two files and will return the current line of
each file each time it's next() method is called
 def read():
    a = open('filea', 'r')
    b = open('fileb', 'r')
    la = a.readline()
    lb = b.readline()
    while (la and lb):
        yield la,lb
        la = a.readline()
        lb = b.readline()

for a, b in read():
  some operation on a,b



More information about the SciPy-user mailing list