[SciPy-user] handling of huge files for post-processing

Bruce Southey bsouthey@gmail....
Wed Feb 27 09:46:02 CST 2008


Hi,
Christoph, I am unclear exactly what you are really doing, are you
just reading, converting, grouping and summing across files? (A small
example always helps.) Based on what you have indicated, I doubt that
just switching PyTables will be sufficient. Your emails suggest that a
different scheme is required than what you are currently doing. It
should be expected that large or many files will be resource intensive
- the key is to determine which bottlenecks can be removed. Depending
on what need to be done, you can process files one at a one and
accumulate the results which means that you only deal with one file at
a time. Alternatively you can process chunks where you only use
specific chunks of the files but requires you to reread files multiple
times.

Regards,
Bruce





On Tue, Feb 26, 2008 at 11:23 AM, David Huard <david.huard@gmail.com> wrote:
> Whether or not PyTables is going to make a difference really depends on how
> much data you need at a given time to perform the computation. If this
> exceeds your RAM, it doesn't matter what binary format you are using. That
> being said, I am not familiar with sqlite, so I don't know if there is some
> limitations regarding the database size.
>
> Storing your data using PyTables will allow you to store as many GB in a
> single file as you wish. The tricky part will then be to extract only the
> data that you need to perform your computations and make sure this always
> stays below the RAM limit, or else the swap memory will be used and it will
> slow down things considerably.
>
> I suggest you try to estimate how much memory you'll be needing for your
> computations, see how much RAM you have, and decide whether or not you
> should just spend some euros and install additional RAM.
>
> Servus,
>
> David
>
> 2008/2/26, Christoph Scheit <Christoph.Scheit@lstm.uni-erlangen.de>:
> > Hello David,
> >
> > indeed data in file a depends on data in file b...
> > that the biggest problem and consequently
> > I guess I need something that operates better
> > on the file-system than in main memory.
> >
> > Do you think, it's possible to use PyTables to
> > tackle the problem? I would need something
> > that can group together such enormous
> > data-sets. sqlite is nice to group data of
> > a table together, but I guess my data-sets are
> > just to big...
> >
> > Acutally I unfortunately don't see the possibility
> > to iterate over the entries of the files in the
> > manner you described below....
> >
> > Thanks,
> >
> > Christoph
> > ------------------------------
> >
> > Message: 3
> > Date: Tue, 26 Feb 2008 09:17:00 -0500
> >
> > From: "David Huard" <david.huard@gmail.com>
> > Subject: Re: [SciPy-user] handling of huge files for post-processing
> > To: "SciPy Users List" <scipy-user@scipy.org>
> > Message-ID:
> >
> >         <91cf711d0802260617o4d768824wbf5fae702b59f00a@mail.gmail.com>
> >
> > Content-Type: text/plain; charset="iso-8859-1"
> >
> >
> > Cristoph,
> >
> > Do you mean that b depends on the entire dataset a ? In this case, you
> might
> > consider buying additional memory; this is often way cheaper in terms of
> > time than trying to optimize the code.
> >
> > What I mean by iterators is that when you open a binary file, you
> generally
> > have the possibility to iterate over each element in the file. For
> instance,
> > when reading an ascii file:
> >
> > for line in f.readline():
> >     some operation on the current line.
> >
> > instead of loading all the file in memory:
> > lines = f.readlines()
> >
> > This way, only one line is kept in memory at a time. If you can write your
> > code in this manner, this might solve your memory problem. For instance,
> > here is a generator that opens two files and will return the current line
> of
> > each file each time it's next() method is called
> >   def read():
> >     a = open('filea', 'r')
> >     b = open('fileb', 'r')
> >     la = a.readline()
> >     lb = b.readline()
> >     while (la and lb):
> >         yield la,lb
> >         la = a.readline()
> >         lb = b.readline()
> >
> > for a, b in read():
> >   some operation on a,b
> >
> > HTH,
> >
> > David
> >
> >
> > _______________________________________________
> > SciPy-user mailing list
> > SciPy-user@scipy.org
> > http://projects.scipy.org/mailman/listinfo/scipy-user
> >
>
>
> _______________________________________________
>  SciPy-user mailing list
>  SciPy-user@scipy.org
>  http://projects.scipy.org/mailman/listinfo/scipy-user
>
>


More information about the SciPy-user mailing list