[Numpy-discussion] Efficient way to load a 1Gb file?
Wed Aug 10 14:43:47 CDT 2011
On 10 Aug 2011, at 19:22, Russell E. Owen wrote:
> A coworker is trying to load a 1Gb text data file into a numpy array
> using numpy.loadtxt, but he says it is using up all of his machine's 6Gb
> of RAM. Is there a more efficient way to read such text data files?
The npyio routines (loadtxt as well as genfromtxt) first read in the entire data as lists, which creates of course significant overhead, but is not easy to circumvent, since numpy arrays are immutable - so you have to first store the numbers in some kind of mutable object. One could write a custom parser that tries to be somewhat more efficient, e.g. first reading in sub-arrays from a smaller buffer. Concatenating those sub-arrays would still require about twice the memory of the final array. I don't know if using the array.array type (which is mutable) is much more efficient than a list...
To really avoid any excess memory usage you'd have to know the total data size in advance - either by reading in the file in a first pass to count the rows, or explicitly specifying it to a custom reader. Basically, assuming a completely regular file without missing values etc., you could then read in the data like
X = np.zeros((n_lines, n_columns), dtype=float)
delimiter = ' '
for n, line in enumerate(file(fname, 'r')):
X[n] = np.array(line.split(delimiter), dtype=float)
(adjust delimiter and dtype as needed...)
More information about the NumPy-Discussion