[Numpy-discussion] Efficient way to load a 1Gb file?
Wed Aug 10 15:01:37 CDT 2011
There was also some work on a semi-mutable array type that allowed
appending along one axis, then 'freezing' to yield a normal numpy
array (unfortunately I'm not sure how to find it in the mailing list
archives). One could write such a setup by hand, using mmap() or
realloc(), but I'd be inclined to simply write a filter that converted
the text file to some sort of binary file on the fly, value by value.
Then the file can be loaded in or mmap()ed. A 1 Gb text file is a
miserable object anyway, so it might be desirable to convert to (say)
HDF5 and then throw away the text file.
On 10 August 2011 15:43, Derek Homeier
> On 10 Aug 2011, at 19:22, Russell E. Owen wrote:
>> A coworker is trying to load a 1Gb text data file into a numpy array
>> using numpy.loadtxt, but he says it is using up all of his machine's 6Gb
>> of RAM. Is there a more efficient way to read such text data files?
> The npyio routines (loadtxt as well as genfromtxt) first read in the entire data as lists, which creates of course significant overhead, but is not easy to circumvent, since numpy arrays are immutable - so you have to first store the numbers in some kind of mutable object. One could write a custom parser that tries to be somewhat more efficient, e.g. first reading in sub-arrays from a smaller buffer. Concatenating those sub-arrays would still require about twice the memory of the final array. I don't know if using the array.array type (which is mutable) is much more efficient than a list...
> To really avoid any excess memory usage you'd have to know the total data size in advance - either by reading in the file in a first pass to count the rows, or explicitly specifying it to a custom reader. Basically, assuming a completely regular file without missing values etc., you could then read in the data like
> X = np.zeros((n_lines, n_columns), dtype=float)
> delimiter = ' '
> for n, line in enumerate(file(fname, 'r')):
> X[n] = np.array(line.split(delimiter), dtype=float)
> (adjust delimiter and dtype as needed...)
> NumPy-Discussion mailing list
More information about the NumPy-Discussion