[Numpy-discussion] Efficient way to load a 1Gb file?
Thu Aug 11 23:49:18 CDT 2011
On 8/10/2011 1:01 PM, Anne Archibald wrote:
> There was also some work on a semi-mutable array type that allowed
> appending along one axis, then 'freezing' to yield a normal numpy
> array (unfortunately I'm not sure how to find it in the mailing list
That was me, and here is the thread -- however, I'm on vacation, and
don't have the test code, etc with me, but I found the core class. It's
>> The npyio routines (loadtxt as well as genfromtxt) first read in the entire data as lists, which creates of course significant overhead, but is not easy to circumvent, since numpy arrays are immutable - so you have to first store the numbers in some kind of mutable object. One could write a custom parser that tries to be somewhat more efficient, e.g. first reading in sub-arrays from a smaller buffer. Concatenating those sub-arrays would still require about twice the memory of the final array. I don't know if using the array.array type (which is mutable) is much more efficient than a list...
Indeed, and are holding all the text as well, which is generally going
to be bigger than the resulting numbers.
Interesting, when I wrote accumulator, I found that it didn't, for the
most part, have any performance advantage over accumlating on lists,
then converting to arrays -- but there is a memory advantage, so this
may be a good use case. you could do something like (untested):
If your rows are all one dtype:
X = accumulator(dtype=np.float32, block_shape = (num_cols,))
if they are not, then build a custon dtype to hold the rows, and use that:
dt = np.dtype('%id'%num_columns) # create a dtype that holds a row
#num_columns doubles in this case.
# create an accumulator for that dtype
X = accumulator(dtype=dt)
# loop through the file to build the array:
delimiter = ' '
for line in file(fname, 'r'):
X.append ( np.array(line.split(delimiter), dtype=float) )
X = np.array(X) # gives a regular old array as a copy
I note that converting to a regular array requires a data copy, which,
if memoery is tight, might not be good. The solution would be to have a
way to make a view, so you'd get a regular array from the same data
(with maybe the extra buffer space)
I'd like to see this calss get more mature, robust, and better
performing, but so far it's worked for my use cases. Contributions welcome.
Christopher Barker, Ph.D.
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
More information about the NumPy-Discussion