[Numpy-discussion] Memory usage of numpy-arrays

Wes McKinney wesmckinn@gmail....
Thu Jul 8 08:52:59 CDT 2010

On Thu, Jul 8, 2010 at 9:26 AM, Hannes Bretschneider
<hannes.bretschneider@wiwi.hu-berlin.de> wrote:
> Dear NumPy developers,
> I have to process some big data files with high-frequency
> financial data. I am trying to load a delimited text file having
> ~700 MB with ~ 10 million lines using numpy.genfromtxt(). The
> machine is a Debian Lenny server 32bit with 3GB of memory.  Since
> the file is just 700MB I am naively assuming that it should fit
> into memory in whole. However, when I attempt to load it, python
> fills the entire available memory and then fails with
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File "/usr/local/lib/python2.6/site-packages/numpy/lib/io.py", line 1318, in genfromtxt
>    errmsg = "\n".join(errmsg)
> MemoryError
> Is there a way to load this file without crashing?
> Thanks, Hannes
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

>From my experience I might suggest using PyTables (HDF5) as
intermediate storage for the data which can be populated iteratively
(you'll have to parse the data yourself, marking missing data could be
a problem). This of course requires that you know the column schema
ahead of time which is one thing that np.genfromtxt will handle
automatically. Particularly if you have a large static data set this
can be worthwhile as reading the data out of HDF5 will be many times
faster than parsing the text file.

I believe you can also append rows to the PyTables Table structure in
chunks which would be faster than appending one row at a time.


More information about the NumPy-Discussion mailing list