[Numpy-discussion] Efficient way to load a 1Gb file?
Tue Aug 23 11:07:08 CDT 2011
On 11.08.2011, at 8:50PM, Russell E. Owen wrote:
> It seems a shame that loadtxt has no argument for predicted length,
> which would allow preallocation and less appending/copying data.
> And yes...reading the whole file first to figure out how many elements
> it has seems sensible to me -- at least as a switchable behavior, and
> preferably the default. 1Gb isn't that large in modern systems, but
> loadtxt is filing up all 6Gb of RAM reading it!
1 GB is indeed not much in terms of disk space these days, but using text
files for such data amounts is nonetheless very much non-state-of-the-art ;-)
That said, of course there is no justification to use excessive amounts of
memory where it could be avoided!
Implementing the above scheme for npyio is not quite as straightforward
as in the example I gave before, mainly for the following reasons:
loadtxt also has to deal with more complex data like structured arrays,
plus comments, empty lines etc., meaning it has to find and count the
actual valid data lines.
Ideally, genfromtxt, which offers yet more functionality to deal with missing
data, should offer the same options, but they would be certainly more
difficult to implement there.
More than 6 GB is still remarkable - from what info I found in the web, lists
seem to consume ~24 Bytes/element, i.e. 3 times more than a final float64
array. The text representation would typically take 10-20 char's for one
float (though with <12 digits, they could usually be read as float32 without
loss of precision). Thus a factor >6 seems quite extreme, unless the file
is full of (relatively) short integers...
But this also means copying of the final array would still have a relatively
low memory footprint compared to the buffer list, thus using some kind of
mutable array type for reading should be a reasonable solution as well.
Unfortunately fromiter is not of that much use here since it only reads
1D-arrays. I haven't tried to use Chris' accumulator class yet, so for now
I did go the 2x read approach with loadtxt, it turned out to add only ~10%
to the read-in time. For compressed files this goes up to 30-50%, but
once physical memory is exhausted it should probably actually become
I've made a pull request
implementing that option as a switch 'prescan'; could you review it in
particular regarding the following:
Is the option reasonably named and documented?
In the case the allocated array does not match the input data (which
really should never happen), right now just a warning is issued,
filling any excess buffer with zeros or discarding remaining input data -
should this rather raise an IndexError?
No prediction if/when I might be able to provide this for genfromtxt, sorry!
More information about the NumPy-Discussion