[Numpy-discussion] Memory usage of numpy-arrays
Thu Jul 8 09:46:17 CDT 2010
On 07/08/2010 08:52 AM, Wes McKinney wrote:
> On Thu, Jul 8, 2010 at 9:26 AM, Hannes Bretschneider
> <firstname.lastname@example.org> wrote:
>> Dear NumPy developers,
>> I have to process some big data files with high-frequency
>> financial data. I am trying to load a delimited text file having
>> ~700 MB with ~ 10 million lines using numpy.genfromtxt(). The
>> machine is a Debian Lenny server 32bit with 3GB of memory. Since
>> the file is just 700MB I am naively assuming that it should fit
>> into memory in whole. However, when I attempt to load it, python
>> fills the entire available memory and then fails with
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in<module>
>> File "/usr/local/lib/python2.6/site-packages/numpy/lib/io.py", line 1318, in genfromtxt
>> errmsg = "\n".join(errmsg)
>> Is there a way to load this file without crashing?
>> Thanks, Hannes
>> NumPy-Discussion mailing list
> > From my experience I might suggest using PyTables (HDF5) as
> intermediate storage for the data which can be populated iteratively
> (you'll have to parse the data yourself, marking missing data could be
> a problem). This of course requires that you know the column schema
> ahead of time which is one thing that np.genfromtxt will handle
> automatically. Particularly if you have a large static data set this
> can be worthwhile as reading the data out of HDF5 will be many times
> faster than parsing the text file.
> I believe you can also append rows to the PyTables Table structure in
> chunks which would be faster than appending one row at a time.
> NumPy-Discussion mailing list
There have been past discussions on this. Numpy needs contiguous memory
so you are running out of memory because as loading the original data
and the numpy array will exhaust your available contiguous memory. Note
that a file of ~700 MB does not translate into ~700 MB of memory since
it depends on the dtypes. Also a system with 3GB of memory probably has
about 1.5GB of free memory available (you might get closer to 2GB if you
have a very lean system).
If you know your data then you have do all the hard work yourself to
minimize memory usage or use something like hdf5 or PyTables.
More information about the NumPy-Discussion